Hacks for Globus Jobmanager for Condor

Author: Dan Bradley
Last Updated: 2007-11-19

This is a collection of useful patches that I happen to be aware of for condor.pm, the Globus jobmanager for Condor. This file is found in $OSG_LOCATION/globus/lib/perl/Globus/GRAM/JobManager/.

NFSLite

The most famous condor.pm hack in OSG is NFSLite, developed by Terrence Martin at UCSD. It is a relatively small patch that turns on Condor file transfer mode in order to reduce use of the NFS server. The standard input/output, user proxy, and files in the job's GRAM scratch directory are copied to/from the job's temporary scratch directory on the worker node.

NFSLite is currently available as a VDT package. There is further documentation here.

Job Wrapper for OSG Jobs

Why would you want to have a wrapper script start the user job? One reason is to have the environment variable OSG_WN_TMP set equal to the value of _CONDOR_SCRATCH_DIR. Then when the job runs, it can do its scratch work in the temporary directory created by Condor for the job. The advantage of this is that Condor automatically cleans up the contents of this directory if the job leaves anything behind.

One way to achieve this is to configure the worker nodes with a USER_JOB_WRAPPER. However, in Wisconsin, we flock OSG jobs to several condor pools and we don't want to make OSG-specific modifications to the configuration of these other nodes, if at all possible. Therefore, this small hack to condor.pm runs a wrapper script that requires no configuration or file installation on the worker nodes.

The wrapper script itself is here The modification to condor.pm should be inserted in the section of condor.pm where the condor submit file is being created. I put it after the line that sets X509UserProxy.

#Find out if user is overriding universe. my $real_universe = $universe; if(defined($submit_attrs[0])) { foreach $tuple (@submit_attrs) { if( lc(@$tuple[0]) eq "universe" ) { $real_universe = @$tuple[1]; } } } if( lc($real_universe) ne "standard" ) { #Turn on file-transfer mode print SCRIPT_FILE "ShouldTransferFiles = true\n"; print SCRIPT_FILE "WhenToTransferOutput = ON_EXIT\n"; print SCRIPT_FILE "remote_initialdir = " . $description->directory() . "\n"; #Replace the user's executable with a custom wrapper script. print SCRIPT_FILE "Executable = $ENV{GLOBUS_LOCATION}/../osg_job_wrapper/osg_job_wrapper\n"; #If the wrapper exits with SIGUSR1, this indicates a transient error, #such as a problem downloading the user's executable. Requeue the #job in this case. print SCRIPT_FILE "on_exit_remove = (ExitBySignal == False) || (ExitSignal != 10)\n"; #Place the gridftp URL of the user's executable in the environment, #so the wrapper script can download it. require POSIX; my $hostname = (POSIX::uname())[1]; my $url = "gsiftp://$hostname:2811" . $description->executable(); $environment_string = $environment_string . ";OSG_JOB_WRAPPER_CMD_URL=$url"; #Remove the X509_USER_PROXY setting from the environment, so #the Condor jobmanager automatically sets it to the correct #path in the remote scratch directory. $environment_string =~ s|X509_USER_PROXY=([^;]*);*|;|g; }

Adding OSG VO information to the job ClassAd

Having OSG VO information in the job ClassAd is useful in a number of ways. For example, you can write machine RANK expressions that favor some OSG VO's over others.

Here are the lines to add to condor.pm. I put them after the line that sets X509UserProxy.

#Insert OSG_VO into job ClassAd my $username = getpwuid($<); local(*VO_MAP_FILE); if(open(VO_MAP_FILE, "<$ENV{GLOBUS_LOCATION}/../monitoring/grid3-user-vo-map.txt")) { my @vo_map_lines = grep(/^${username} /,<VO_MAP_FILE>); if( (scalar @vo_map_lines) == 1 ) { my $osg_vo; ($username,$osg_vo) = split(" ",@vo_map_lines[0]); print SCRIPT_FILE "+OSG_VO = \"${osg_vo}\"\n"; } close(VO_MAP_FILE); }

Making jobs run on both 32 and 64-bit nodes

By default, the Globus jobmanager for Condor inserts requirements that prevent jobs from running on both 32 and 64-bit systems. A simple modification will allow jobs to run on both types (i.e. assuming all jobs are actually 32-bit and all 64-bit systems have the necessary 32-bit compatibility libraries).

The modification is to comment out the following line:

$requirements .= " && Arch == \"" . $description->condor_arch() . "\" ";