University of Wisconsin-Madison Skip navigationUW-Madison Home PageMy UW-MadisonSearch UW
 

 

UW Home page

University Communications
UW High Energy Physics

 

UW-HEP Condor User Info


Q1. I am in posession of a program that requires large amounts of CPU time to run, and I would like to use the Wisconsin grid for running this program. How do I do it?

Some examples of how to run a program under Condor are presented below. First, some general remarks:

If you are a CMS user, there is more specific information by going here and clicking on "User Documentation".

You will need an account for logging into login.hep.wisc.edu. To request an account, contact help AT hep.wisc.edu.

To tell Condor to run a program, you must first describe the "job" in a submit description file. You then tell Condor to run the job(s) by running condor_submit. Each computer has its own queue of jobs, so if you submit jobs from say login02, you must remember to go back to login02 to manage those jobs. You can view the job queue using the condor_q command. If you don't remember where you submitted the job, you can view all jobs still in the queue using condor_q -global your-user-name. General advice on submitting and managing Condor jobs can be found in the Condor Manual.

It is best to break up the work you have to do into chunks that take less than 24 hours and more than a few minutes. This reduces time lost to overhead and improves the chances of the job running to completion before being preempted by other higher priority users. See more about preemption below.

Some Condor worker nodes are 32-bit machines. Most are 64-bit. If you compile your program in 32-bit mode, it can run on all Condor worker nodes. The login.hep.wisc.edu machines are 64-bit machines, so by default, compiled programs will be 64-bit. To tell gcc to generate 32-bit programs, use the -m32 option.

By default, Condor will assume that your job requires the same architecture as the submit machine. So if you submit a 32-bit program from a 64-bit machine, it will still only be allowed to run on 64-bit machines. To tell Condor that you want 32-bit or 64-bit you can override the default by explicitly specifying this requirement in your Condor submit file:

requirements = ARCH == "INTEL" || ARCH == "X86_64"

Similarly, some Condor worker nodes may be running older versions of Linux. If your job is compiled on Scientific Linux 5, for example, it may fail to run on a Scientific Linux 4 machine. Compiling your program on an older version of Linux is one solution. Statically compiling your program is another way to try to make it more portable, though in practice we have found that the program is still not portable if it makes any libc calls that do DNS lookups or reading of unix account information, since these are handled via dynamic library loading.

If the program simply can't run on older versions of Linux, you should specify what version is required. One way to do this is by checking the glibc version.

Example for specifying that Scientific Linux 5 or newer is required:

requirements = TARGET.OSglibc_major == 2 && TARGET.OSglibc_minor >= 5

Note that if you have multiple requirements (such as 64-bit and SL5), you must combine these into a single requirements expression using the && boolean operator.

Case 1: Use AFS for software and /scratch for data files

In some cases it may be convenient to use AFS for your job's data files. However, there are a number of disadvantages to using AFS in this way, including both performance and security, so we strongly recommend that you put your input/output data files in a directory in /scratch/your-user-name. Your program executables and libraries could also exist in /scratch, but it is usually convenient and reasonable to put these in AFS so that libraries can be easily accessed by the Condor job from wherever it runs.

To grant Condor access to your software, you must make the directories containing the software readable without an AFS token and you must make all parent directories listable without an AFS token. We recommend using the condor-hosts AFS group for this purpose. The following example command can be used to grant access to a sub-directory sw in your AFS home directory:

fs setacl -dir ~ -acl condor-hosts l
find ~/sw -noleaf -type d -exec fs setacl -dir '{}' -acl condor-hosts rl \;

Once you have compiled your program (keeping in mind the 32-bit/64-bit issues mentioned above), and your input files are ready, you can create a submit file describing your job(s) and submit it to Condor. Here is a simple example:

universe = vanilla
executable = /afs/hep.wisc.edu/home/dan/sw/my_program

arguments = arg1 arg2

# Copy environment variables that are set at submit time, such as
# LD_LIBRARY_PATH.
getenv = true

should_transfer_files = yes
when_to_transfer_output = ON_EXIT

output = stdout
error = stderr
log = condor_log

transfer_input_files = inputfile1,inputfile2

# remove the following if you wish to receive email when the job
# completes
notification = never

queue

Once the submit file is ready, you can submit the job to Condor using the following command. Run this command from the directory where you want the output files to go (i.e. in /scratch/your-user-name/...) or expicitly specify an initial working directory in the submit file.

condor_submit submit_file

Important events in the life of your job will be logged in the log file specified in the submit file (condor_log in the example above). This includes the time and place where your job began executing and the time when it finished or was preempted by higher priority users on the machine where it was running. You can view the current status of the job in the job queue using condor_q jobid.

Case 2: Use AFS for software and data files

To allow your job to write to an AFS directory, you must give all processes on all Condor worker nodes and submit machines the ability to write to the directory. This is generally not a good thing to do. Don't do things this way unless you have to! Please inform us at condor-help AT hep.wisc.edu before you make heavy use of this option, because it can cause performance problems on the AFS server when many Condor jobs are writing to it at the same time.

The following command can be used to give all Condor machines write access to a directory:

find /path/to/directory -noleaf -type d -exec fs sa -dir '{}' -acl condor-hosts rlidkw \;

When you are done, you should remove the ability of Condor machines to write to the directory. To do that, use the following command:

find /path/to/directory -noleaf -type d -exec fs sa -dir '{}' -acl condor-hosts none \;

Q2. What Condor pools exist at UW Madison?

From hep.wisc.edu, several Condor pools are accessible. You don't have to do anything special to access them. Once jobs are submitted to Condor at hep.wisc.edu, they "flock" to these pools.

  1. condor.hep.wisc.edu: The local HEP Condor pool is just a collection of desktops and a few other machines. If you have a lot of jobs to run, they will generally spill over from this pool into the other larger pools on campus.
  2. glow.cs.wisc.edu: The GLOW Condor pool is composed of machines owned by various big computing users on campus, including CMS, ATLAS, IceCube, and others. Since these machines are owned by specific groups, those groups have immediate priority. Guest jobs will be kicked off whenever the owners have work to do. (See Preemption for more information on dealing with your job getting kicked off.)
  3. cm.chtc.wisc.edu: The CHTC Condor pool is for use by UW Madison researchers.
  4. glidein.chtc.wisc.edu: The dynamic OSG glidein pool for CHTC users. See Open Science Grid for more information.

To view the status of machines in the various condor pools, use condor_status. For most purposes, such as seeing how much horse-power exists, you should filter out special-purpose slots as in the following example queries:

condor_status -pool condor.hep.wisc.edu -constraint 'IsGeneralPurposeSlot'
condor_status -pool glow.cs.wisc.edu -constraint 'IsGeneralPurposeSlot'
condor_status -pool cm.chtc.wisc.edu -constraint 'IsGeneralPurposeSlot'
condor_status -pool glidein.chtc.wisc.edu -const 'IS_MONITOR_VM =!= True'

To see more details about the machines in the pools, you can use the -long or -format options to condor_status. For example, to see what operating system flavors are being run you could use condor_status -long and notice that in the information about each machine, there is an attribute named OSIssue. The following command could then be used to summarize how many slots are running each operating system flavor in the CHTC condor pool:

condor_status -pool cm.chtc.wisc.edu -constraint 'IsGeneralPurposeSlot' \
              -format "%s\n" OSIssue | sort | uniq -c

To additionally see which glibc versions are in use, the following command could be used:

condor_status -pool cm.chtc.wisc.edu -constraint 'IsGeneralPurposeSlot' \
              -format "%s " OSIssue \
              -format "glibc %s." OSglibc_major \
              -format "%s\n" OSglibc_minor \
              | sort | uniq -c

Q3. Why is my job not running?

There is a tool for analyzing what machines match your job's requirements. Example:

condor_q -better-analyze -pool cm.chtc.wisc.edu jobid

One possible reason for a job not to be running is that Condor hit some error such as a missing input file. In this case, the job will go "on hold", indicated with an 'H' in the status field in the job queue. To remove the job and resubmit it, use condor_rm jobid. If instead you can fix the problem without resubmitting the job, you can release it from hold with condor_release jobid.

Another reason for a job not to be running is that it ran once and Condor observed that it had a very large virtual image size before the job was preempted. Then future attempts to find a suitable machine may fail if no slots have sufficient memory to match the observe image size. If you have this problem, try to reduce the amount of memory needed by the job. If that is not possible, contact us for additional options.

Q4. My jobs keep getting evicted. What can I do?

Your job may get kicked off of a computer before it finishes in some cases. This can happen for two main reasons: the owner of the machine has immediate need for that machine, or another user in the pool with a better fair-share priority has work for the machine to do. In the case of preemption by the machine owner, your job is kicked off immediately. In the case of fair-share prioritization, your job can run for typically up to 24 hours before being killed.

If you are willing to restrict yourself to machines that are not owned by anyone else (i.e. machines provided for all UW researchers), then you can avoid having your job evicted by the machine owner. If you have a relatively small number of jobs, this is a reasonable thing to do. The following may be inserted into your Condor submit file to achieve this. If you already have a requirements line, you must logically merge this one with your other requirements.

+estimated_run_hours = 24
requirements = TARGET.MAX_PREEMPT >= MY.estimated_run_hours*3600

Alternatively, you can try to be opportunistic and get work out of the computers that are owned by other people. When your job is preempted by someone else, it returns to the idle state in the job queue and will try to run again. It is possible to make a job save state when it is kicked off so that it can resume from where it left off. Otherwise, it must restart from the beginning.

One way to make it save state is to use Condor's "standard universe". This requires relinking your program with Condor's standard library. Not all programs are compatible with this (e.g. multi-threaded or dynamically linked programs). For more information, see the Condor Manual.

Another option is to have your job intercept the kill signal (SIGTERM) sent by Condor when it wishes to kick the job off the machine. It should then quickly write out whatever information it needs in order to resume from where it left off. (If it doesn't shut down within the grace period (typically 10 minutes), then it will be hard-killed with SIGKILL.) In order to tell Condor to save the intermediate files that your program has generated in the working directory on the worker node, you should use the following option in the submit description file:

when_to_transfer_output = ON_EXIT_OR_EVICT

The Open Science Grid (OSG)

The UW is part of the Open Science Grid. This means that you can use computers at many other campuses when those computers are available for opportunistic use. For the right type of job, this can add up to a lot of additional computing power.

The mechanism that we use to access the OSG is called glideinWMS. When jobs are submitted that express a desire to run on the OSG, this causes the UW's glidein.chtc.wisc.edu Condor pool to be dynamically expanded as computers from the OSG are made available.

Requirements for a job to run on OSG:

  • The job must be submitted from login.hep.wisc.edu or submit.chtc.wisc.edu. If you would like additional submit machines to be supported, please let us know.
  • The job must define WantGlidein=true. To do this, insert the following line in the job's submit file:
    +WantGlidein = true
    

    Note that the '+' is a required part of the syntax.

  • The job must be a vanilla universe job (the default). Standard universe is not currently supported, due to firewalls that exist at many OSG sites.
  • The job must be entirely self-contained. For example, it must not depend on access to AFS, because the computers at other campuses often do not support AFS.
  • The job should make minimal assumptions about what shared libraries and other programs are available. A variety of Linux versions exist on the OSG. It is best to ship all libraries with the job (or statically compile). As of 2012-02-22, most machines in OSG are compatible with Scientific Linux 5.
  • The job should not need to run for long periods of time to get useful work done. A job that runs for 2 hours or less is ideal. A job that runs for more than a day must specify estimated_run_hours, which may limit the availability of computers that it has access to. If it does not specify estimated_run_hours, the job will likely get interrupted before it finishes, which will cause the job to return to the idle state and start over from the beginning in the next attempt. To set this parameter, put the following in your submit file, adjusting the number of hours to be appropriate for your job:
    +estimated_run_hours = 36
    
  • The job should ideally use less than 2GB of RAM. If it needs more, request_memory must be set to the required amount of memory (in MB). This may reduce the number of computers available to run the job, but it is important to avoid running on computers with not enough memory. Example:
    request_memory = 4000
    

    Note that a + should not preceed the setting of request_memory, because this is a built-in command recognized by condor_submit.

    Another way to get more memory for the job is to run in whole-machine mode. This is described below.

  • The job should ideally only keep one CPU core busy. If the job needs all of the CPU power and/or memory of the machine, it must run in whole-machine mode. To do that, the job must be submitted with the following in the submit file. (Merge the requirements with your existing requirements, if any.)
    +RequiresWholeMachine = True
    requirements = CAN_RUN_WHOLE_MACHINE
    

    Note that there may not be as many computers available to run whole-machine jobs as single-core jobs.

  • Some OSG sites require that jobs be submitted with x509 credentials. Therefore, if you wish to have access to more resources, you will need to get an x509 certificate that is registered as a member of the GLOW Virtual Organization. To do that, email help AT hep.wisc.edu.

Because of the flexibility of Condor flocking, a batch of identically submitted jobs may use some computers in UW Condor pools as well as OSG computers when and if they become available. Condor will try to find a machine for your job at UW before it attempts to run it on OSG. Therefore, if your job is suitable for running in OSG, there are few disadvantages to doing so. The two main costs to consider are:

  1. Your job may run on computers that are configured differently from UW computers; if this causes problems, you may have to spend a little more time debugging.
  2. Your job may get preempted (killed) at any time when there is higher priority work at the OSG site. When this happens, the job will remain in the queue and will be scheduled to run again when a computer becomes available.

Although normally it is desireable to let jobs run at UW in addition to OSG, for testing, you may wish to submit jobs that will only run in OSG. The following requirements expression can be used to do that. If you already have a requirements expression in your submit file, you will need to logically AND this expression with your existing one.

requirements = IS_GLIDEIN

Getting Help

  • Please contact help AT hep.wisc.edu if you have trouble and need further assistance.

 

 
 
UW High Energy Physics | UW Home