Wisconsin CMS Tier-2 FAQ

.

FAQ and Troubleshooting

dCache Data Storage
The "farmout" Scripts
Condor Batch System
General

dCache Data Storage


How can I get a copy of the data files?

You must first find out the path to your files in UW-HEP dCache.

To do a bulk copy of all of the files in a dataset, it is easier to use dccp from the command line. To do this, log into the server where you have some scratch space to store the files (e.g. sesame). You may need to go through login.hep.wisc.edu if you are outside the hep.wisc.edu domain. Next, find the directory containing your files in /pnfs/hep.wisc.edu/.../dataset. (You will need to be on a computer with /pnfs/hep.wisc.edu mounted to do that.) Then copy the files like this:

source /afs/hep.wisc.edu/cms/cmsprod/setup.sh
dccp_many -r /pnfs/hep.wisc.edu/.../dataset /data/mydata/

dccp_many is simply a script that calls dccp for each file in a list of files, since dccp itself will only copy one file at a time.


Can I access data files from outside the hep.wisc.edu domain?

Yes. All data files are globally readable through dCache through several protocols: dcap, http, gsiftp, srm.

If you want to get a local copy of data files from UW-HEP, you just need a list of filenames. Once you have that, you may copy them. For example, to get the files via http, assuming you have the /pnfs paths stored in a file named "filenames":

#run bash if you are using some other shell
cd /local/path/to/store/files
for file in `cat filenames`; do
  wget http://cmsdcap.hep.wisc.edu:2244$file
done

You can also retrieve the files via the dcap protocol. For example:

dccp dcap://cmsdcap.hep.wisc.edu:22125/pnfs/hep.wisc.edu/cmsprod/test_file .

Yet another way is to use the SRM protocol. Example:

srmcp -debug=true srm://cmssrm.hep.wisc.edu:8443//pnfs/hep.wisc.edu/cmsprod/test_file file://localhost//tmp/test_file

How can I manage my files in dCache?

On machines with /pnfs mounted (e.g. login machines), you should be able to see files in dCache with ls. However, it is usually the case that your unix user id does not match the unix user id that owns the files in dCache, so you have to go through an additional step in order to do other operations, such as rm, mv, mkdir, rmdir. Ideally, these operations could be done via SRM commands. However, the current version of SRM does not allow this. Therefore, the best solution is to submit file management commands through the grid. Your grid credentials should map you to the same user who owns the files in dCache, so you should be able to do any namespace operations (such as rm, mv, mkdir, etc.). For interactive file management, the simplest solution is to run an xterm on the grid gatekeeper. Here is a script that does this for you: /afs/hep.wisc.edu/cms/cmsprod/bin/grid-xterm-uwhep

An xterm should eventually pop up on your screen and you should find that you are at a shell prompt as the user that your grid credentials are mapped to. You can then cd> into the pnfs directory where you want to work and start running commands.

If you get errors about not being able to open the display, then you need to make sure your ssh client is forwarding an X session when you log in. For example, on a Mac running OS X, you need to do something like the following:

  1. start up X11.App
  2. from the X11 xterm, connect to UW-HEP:
    ssh -X user@login.hep.wisc.edu
    
  3. grid-proxy-init
  4. grid-xterm-uwhep

Access to some files seems to take forever. How can I tell if dCache is functioning properly?

Normally, our dCache service functions very well. However, we are still working on improving the service to overcome occasional difficulties. You can see the list of active transfers in the dCache server: cms-dcache.hep.wisc.edu:2288. In the far right column, you can see if dCache is having trouble accessing the file if it says "Staging" or "No Mover Found" instead of showing a transfer speed. These messages are expected for short periods of time in a heavily loaded system, but they should go away after a few minutes.

You can also test the ability to access individual files using dccp or any of the other file transfer mechanisms. See Copying Data Files.

The files being transferred are reported by pnfsid. If you need to find out what the filename is, you may use the following command if you are in the CMS AFS group and on any machine with /pnfs/hep.wisc.edu mounted.

source /cms/cmsprod/setup.sh #initialize your environment
dcache_pnfs_pathfinder pnfsid

The files in dCache may also be accessed through xrootd, which offers very high throughput when this is required (e.g. serving pileup data for digitization at high luminosity).


How can I open dCache files directly from root?

If you know the /pnfs path to a file, you can have root read from the file directly by prepending dcap://cmsdcap.hep.wisc.edu:22125 to the file name that you give to root.


How can I see the true size of files greater than 2GB?

Files in dCache larger than 2GB appear in /pnfs with size 1. This is due to a limitation of the NFS protocol. To see the real size, you can use srm-get-metadata. Example:

$ srm-get-metadata srm://cmssrm.hep.wisc.edu:8443//pnfs/hep.wisc.edu/store/user/mbanderson/phtn_jets_170_300-DIGIL1RECO/DIGIL1RECO-765D5386-C1B8-DB11-B23C-00123F207FEC.root FileMetaData(srm://cmssrm.hep.wisc.edu:8443//pnfs/hep.wisc.edu/store/user/mbanderson/phtn_jets_170_300-DIGIL1RECO/DIGIL1RECO-765D5386-C1B8-DB11-B23C-00123F207FEC.root)= RequestFileStatus SURL :srm://cmssrm.hep.wisc.edu:8443//pnfs/hep.wisc.edu/store/user/mbanderson/phtn_jets_170_300-DIGIL1RECO/DIGIL1RECO-765D5386-C1B8-DB11-B23C-00123F207FEC.root size :7090303926 owner :622 group :622 permMode :420 checksumType :adler32 checksumValue :bd7f6d8f isPinned :false isPermanent :true isCached :true state : fileId :0 TURL : estSecondsToStart :0 sourceFilename : destFilename : queueOrder :0

The "farmout" Scripts


How can I use farmoutRandomSeedJobs to submit CMSSW MC jobs to Condor?

farmoutRandomSeedJobs jobName nEvents nEventsPerJob /path/to/CMSSW /path/to/configTemplate

There is an example configuration template here. Use the --help option to see all of the options.


How can I use farmoutAnalysisJobs to submit CMSSW analysis jobs to Condor?

This script will run cmsRun root files in a directory or directory tree. By default, it runs on all root files in a directory in your /pnfs area, using the jobName that you specify to find the files. However, you can direct it to an anlternate path and tell it to exclude root files with names matching a pattern that you specify.

For full options to the script, use the -h option. Here is a brief synopsis:

farmoutAnalysisJobs [options] jobName /path/to/CMSSW /path/to/configTemplate

There is an example configuration template here.


How can I use mergeFiles to merge together analysis output?

mergeFiles [options] output_file.root input_directory(s)

Use mergeFiles -h for a full list of options.

Condor Batch System


How do I submit a generic job to Condor at the Wisconsin Tier-2?

You can submit your job from your working directory in AFS but it is preferable to submit from a local disk, such as /scratch/username. If you don't explicitely provide the names of the input/output directories then your submit directory will be assumed for all input/output operations involving relative paths.

If your output files are being written into AFS, you must make the directory writable by any process running on the machines where condor runs. This should only be done if absolutely necessary. AFS performance may suffer if hundreds of condor jobs all pound on it at the same time. This is also dangerous from a security standpoint, so do not do this on directories containing executables etc. Certainly do not do it on your home directory.

If you really must write to AFS from your Condor jobs, here how you must prepare the AFS directory:

mkdir /path/to/data
fs setacl -dir /path/to/data -acl condor-hosts rlidkw

If you are not using AFS to write output, you must enable Condor's file-transfer mechanism as in the example below.

For full details on how to submit jobs to Condor, see the Condor Manual or the Quick Start. Here is a simple example of a submit description file that you could use to submit a job from one of the login machines at the Wisconsin Tier-2:

Executable = /path/to/your/executable (ex: cmsRun)
Arguments = arg1 arg2 ...
GetEnv = true
Universe = Vanilla
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = inputfile1 inputfile2 ...
output = job.out
error = job.err
log = job.log
notification = never
on_exit_remove = (ExitBySignal == FALSE && ExitStatus == 0)
ImageSize            = 900000
+DiskUsage           = 2000000
Requirements = TARGET.HasAfs =?= True
Queue
(What is the meaning of the above variables and what do they do ? See Explanation . )

  • Then, submit the job to Condor using the command : condor_submit   job_description_file
  • (How do I submit a condor job ? See Examples)

    My Condor jobs keep getting preempted by users with higher priority.

    Condor uses a fair sharing algorithm to distribute resources. Users who claim lots of resources gradually get less priority, so that others do not get starved for resources. In special cases, we may need to adjust priorities in order to get important work done on schedule.

    Since your jobs may run anywhere on the Madison campus Condor grid, your jobs may also be landing in "unfriendly" territory where they are likely to be preempted after a short amount of time. If your job needs a minimum of X time in order to get anything done and you don't want to have it try to run on resources that can't guarantee that amount of uninterrupted time, then you can specify this in the requirements expression. Example:

    requirements = (TARGET.MaxJobRetirementTime >= X)
    

    where X is the number of seconds of runtime that your job requires. Just be careful not to set this too high or you may not find any matching resources. A reasonable value is one or two days. You can use condor_q -analyze on your jobs to see if there are matching resources.


    Why are my jobs idle?

    Jobs submitted to Condor at the Wisconsin Tier-2 may run on resources distributed across the campus grid. It can take a few minutes for the Condor negotiator to come around to your newly submitted job and try finding a machine to run it on. If no machines are immediately available, the job waits in the idle state ('I' in the condor_q output).

    To see how many machines could possible run your job, you can use the following command:

    
    condor_q -pool glow.cs.wisc.edu -analyze <jobid>
    
    

    If your job requirements do not match very many machines, you can try to analyze the requirements:

    
    condor_q -pool glow.cs.wisc.edu -better-analyze <jobid>
    
    

    It may happen that your urgent jobs have no problem matching the requirements of lots of machines, but they are still idle due to machines being busy with other jobs. In this case, let us know and we can see if a priority adjustment would help.


    What this WARNING (File /afs/blah/blah.out is not writable by condor) means ?

    WARNING: File /afs/hep.wisc.edu/user/blah/blah.out is not writable
    by condor.
    
    WARNING: File /afs/hep.wisc.edu/user/blah/blah.error is not
    writable by condor.
    
    The above indicates that the directory "blah" doesn't have write permission for condor-hosts. You really should avoid submitting jobs from AFS if at all possible. If you really must submit from AFS, see the recipe for setting up the ACLs on the AFS directory
    here.

    General


    How can I ssh from Wisconsin Tier-2 to Fermilab?

    Fermilab uses kerberos 5 to authenticate users. The default ssh client at the Wisconsin Tier-2 is only able to handle kerberos 4. However, a kerberos 5 enabled version of the ssh client is provided. Example:

    kinit fnal-usernamey@FNAL.GOV
    ssh-krb5 -2 fnal-username@cmsuaf.fnal.gov
    

    Once you get connected, you will find that you have no AFS token or other kerberos credential at Fermilab. If you do kinit above with the -f option, this will cause your credential to be forwarded when connecting to some Fermilab computers, but for others I find that my login attempt hangs, so rather than using a forwardable kerberos ticket, you may just have to authenticate again (but this time from Fermilab):

    kinit fnal-username@FNAL.GOV
    

    Contacts For Help

    Email: tier2-support@hep.wisc.edu