CVMFS#

We use CVMFS to implement OSG_APP, which is a directory where OSG Virtual Organizations install applications at our site. Since nearly all running applications rely on this installation, it must be reliable and performant. CVMFS provides high reliability and performance by making use of HTTP and local caches.#

Service architecture#

Clients access the cms.hep.wisc.edu CVMFS namespace hosted on cvmfs01.hep.wisc.edu through two Squid caches (frontier01.hep.wisc.edu, frontier02.hep.wisc.edu). Virtual Organizations write into the namespace by submitting jobs to our gatekeeper osggrid01.hep.wisc.edu. From their point of view, it should behave like OSG_APP at any other site. They do not need to do anything CVMFS-specific.#

The CVMFS-writer node has a read-write NFS mount of the writeable file tree that gets published into CVMFS. It has a read-only mount of CVMFS in /cvmfs/pub. All other nodes (worker nodes, interactive machines) have the usual read-only mounts in /cvmfs, using the CVMFS FUSE module.#

OASIS#

The OSG OASIS project is a new service under development in OSG. It basically does what we have done with OSG_APP but it is intended to be hosted centrally and mounted by many OSG sites. Eventually, it may be possible to phase out OSG_APP altogether and replace it with OASIS.#

Configuration#

CFEngine installs the CVMFS configuration files on the server and all clients in the HEP domain. Clients in other campus domains need to install similar configuration files in order to access the filesystem. A cron job runs periodically on the CVMFS server to create subcatalogs and publish changes. Publication involves rsync to copy updates to the CVMFS “shadow tree” from the NFS-exported area written to by CVMFS-writers. When the rsync is finished, the CVMFS server tool is used to update the files and catalogs published by the web server. We experimented with having writers write directly to the shadow tree, but this was problematic, because all writes are blocked during publication, which produced long delays or I/O errors for the writers.#

Setting up a CVMFS client#

You are welcome to use our cms.hep.wisc.edu repository, which contains our $OSG_APP. The following instructions are based on the official CVMFS documentation (PDF). The instructions assume that you’re running RHEL/CENTOS/SL 5.x or better, have access to a local HTTP caching proxy server (like Squid or CMS Frontier) and have administrator privileges.#

First, install the CERNVM Yum repository and key and the CVMFS packages:#

wget http://cvmrepo.web.cern.ch/cvmrepo/yum/cernvm.repo
sudo mv cernvm.repo /etc/yum.repos.d/cernvm.repo

wget http://cvmrepo.web.cern.ch/cvmrepo/yum/RPM-GPG-KEY-CernVM
<verify key>
sudo mv RPM-GPG-KEY-CernVM /etc/pki/rpm-gpg/RPM-GPG-KEY-CernVM

sudo yum install cvmfs-keys cvmfs cvmfs-init-scripts
sudo usermod -aG fuse cvmfs

sudo cvmfs_config setup

Next, set cms.cern.ch as your client’s default repository:#

cat <<EOF | sudo tee /etc/cvmfs/default.local
CVMFS_REPOSITORIES=cms.cern.ch
CVMFS_HTTP_PROXY=http://squidserver.example.com:3128
EOF

sudo service autofs start
sudo service cvmfs restart

sudo cvmfs_config chksetup
sudo service cvmfs probe

At this point, you should be able to list the contents of the /cvmfs/cms.cern.ch directory. Add our repository’s config file and public key to your client’s configuration:#

wget http://hg.hep.wisc.edu/cmsops/cvmfs/raw-file/tip/cms.hep.wisc.edu.conf
sudo mv cms.hep.wisc.edu.conf /etc/cvmfs/config.d/cms.hep.wisc.edu.conf

wget http://hg.hep.wisc.edu/cmsops/cvmfs/raw-file/tip/cms.hep.wisc.edu.pub
sudo mv cms.hep.wisc.edu.pub /etc/cvmfs/keys/cms.hep.wisc.edu.pub

Finally, make sure that cms.hep.wisc.edu is included in the comma-delimited list CVMFS_REPOSITORIES:#

grep CVMFS_REPOSITORIES /etc/cvmfs/default.local
export CVMFS_REPOSITORIES=cms.cern.ch,cms.hep.wisc.edu

You should now be able to start running simple tests using our CMSSW and other installations.#

Maintenance#

Normally, the files published via the web server are updated incrementally. This relies on a kernel module that notifies the CVMFS server of all changes to the shadow tree. We have observed cases where errors during publication leave the shadow tree and published repository out of sync. Until the parts of the shadow tree that are out of sync are modified again, the publication process does not attempt to synchronize them. A comparison of the shadow tree and the published repository can be performed with the following command:#

cvmfs_server fsck

When the repository is out of sync, this can usually be corrected by renaming the directory containing the files that are out of sync and then renaming it back to the original name. The next publication should synchronize things.#

We have observed some kind of corruption in the file catalogs that was not fixed by the renaming procedure. Some paths could not be accessed, and the cvmfs client crashed when we ran find /cvmfs/cms.hep.wisc.edu | wc -l. This may have been caused by some problems that happened during publication when adding a large number of sub-catalogs (~1000) when the file descriptor limit was too low. We have increased the file descriptor limit and decreased the number of catalogs (~100). The problem has not happened again. To recover from this situation when it happened, we regenerated the CVMFS repository from scratch. This can be accomplished with the following commands:#

mv /srv/cvmfs/cms.hep.wisc.edu/pub /srv/cvmfs/cms.hep.wisc.edu/pub.old
mkdir /srv/cvmfs/cms.hep.wisc.edu/pub
chown cvmfs:cvmfs /srv/cvmfs/cms.hep.wisc.edu/pub
mv /srv/cvmfs/cms.hep.wisc.edu/shadow/osg /srv/cvmfs/cms.hep.wisc.edu/shadow/osg.tmp
mv /srv/cvmfs/cms.hep.wisc.edu/shadow/osg.tmp /srv/cvmfs/cms.hep.wisc.edu/shadow/osg
/etc/cvmfs/cvmfs-publish /srv/cvmfs/cms.hep.wisc.edu/export-shadow/ /srv/cvmfs/cms.hep.wisc.edu/shadow/ cmsops@hep.wisc.edu

When the publication is finished, it will have restrarted numbering the catalog revision at 1. This will strand any jobs that have requirements for a revision >= the version that was available on the submit machine at the time of the job submission. We could just condor_qedit all these jobs, but it is easier to manually set the revision number in cvmfs. To do that, do the following, substituting the appropriate version number, which can be determined by looking in /srv/cvmfs/cms.hep.wisc.edu/archives:#

sqlite3 /srv/cvmfs/cms.hep.wisc.edu/pub/catalogs/.cvmfscatalog.working
sqlite> update properties set value=4628 where key="revision";

After editing the revision number in the working catalog, it is necessary to trigger another publication. To do that, touch a file in the shadow tree and rerun the publication process.#

It took about 8 hours to rebuild the repository. There were 3.5M files containing a total of 80GB of data.#

What to do if the repository is lost#

If the filesystem containing the repository (and read-only copy of $OSG_APP) is lost, everything can be regenerated from the writeable copy of $OSG_APP. The following procedure has been used to do this:#

Stop cron and httpd.
Also stop cron on cvmfs03 (mirror server), just to be sure the mirror is untouched.
Replace the broken disk (/dev/sdc) and create a blank filesystem.
Run cfagent to recreate directory structure.
Reboot (causes needed filesystems to be remounted before kernel module is loaded)
Make sure httpd is still off.
Start cron.  Wait for publication to finish.

Update catalog revision id.
  First get the revision of the old mirror.
  login01: sudo cvmfs-talk -i cms.hep.wisc.edu revision

  Then set revision on new catalog to be one bigger:
    cvmfs01: sudo sqlite3 /srv/cvmfs/cms.hep.wisc.edu/pub/catalogs/.cvmfscatalog.working
    update properties set value = "3073" where key = "revision";

Then trigger a new publication to cause the modified revision to take effect.

When satisfied that all is well, start httpd and test clients.
When all tests pass, start cron on cvmfs03 to resume sync of mirror.

What to do if the writeable copy of OSG_APP is lost#

If the writeable copy of $OSG_APP is lost, it can be regenerated from the read-only copy in the cvmfs shadow tree. The following procedure has been used to do that:#

Stop nfs and crond services on cvmfs01.
Stop crond service on cvmfs03 (just to be sure backup remains untouched during repair).
Format and mount new disk for export shadow (/dev/sdb).
rsync from shadow to export-shadow
restore symlink in export-shadow:
  cd osg/app/cmssoft/cms/SITECONF
  ln -f -s T2_US_Wisconsin local
start nfs
After the next publication, test that all is well.  If it is,
start crond on cvmfs03 again to resume sync of the mirror.

What to do if the system disk on the cvmfs server is lost#

Simply replace the disk and re-install via kickstart and cfengine. During this time, it would be prudent to stop cron on cvmfs03 to avoid any possibility of corruption of the mirror until the primary server is trusted again.#

What to do if $OSG_APP/etc/grid3-locations.txt is not getting updated#

For CMS, this file gets updated by a cron job on a cvmfs writer machine (currently osggrid01). The cron job is defined in /etc/cron.d/cmssoft_pubtag. Investigate if the cron job is working.#

Integration with Condor#

If CVMFS is not working on a worker node, we do not want jobs that require it to be sent there. To accomplish this, we have Condor run a startd cron script that checks for a working CVMFS installation and publishes the result. It checks the accessibility of the filesystem and it verifies that the partition containing the CVMFS cache has enough free space.#

Another potential problem is the delay between updates by CVMFS writers and the appearance of those updates on worker nodes. The publication process introduces delays. There may be up to 1 hour additional delay before the CVMFS client reloads the catalog.#

To address this concern, we have Condor advertise the file catalog revision being used by the CVMFS client. When jobs are submitted, we inject requirements into the job that require the revision to be greater than or equal to the revision visible on the submit machine at the time when the job was submitted. The injection of requirements on the OSG CEs is handled via a patch to condor.pm that sets the environment variable _CONDOR_UWCMS_CVMFS_REVISION. (Only a very small part of that patch deals with CVMFS.) This is inserted into the job requirements by something equivalent to the following in the Condor configuration:#

UWCMS_CVMFS_REVISION = 0 # to be overridden by condor.pm
APPEND_REQ_VANILLA = TARGET.UWCMS_CVMFS_Revision >= $(UWCMS_CVMFS_REVISION)

Matching the file catalog revision between the submit and execute machine doesn’t completely solve the problem of CVMFS being up to date. For example, a CVMFS writer could make changes and then submit a job that runs before the changes are even visible on the submit machine. Since updates are assumed to be rare, we do not expect this to be a major problem. One possible solution is for the submitter to verify that the updates are visible on the submit machine before submitting jobs to Condor.#

For CMS, this is already taken care of, because the availability of CMSSW versions is published via a script that monitors $OSG_APP/etc/grid3-locations.txt on the submit machine. This currently benefits from the fact that CMS submit nodes are not CVMFS writers. They have the read-only filesystem mounted in /cvmfs, just like the worker nodes. This is made possible by the fact that CMS software management jobs are identifiable and are routed to designated CVMFS writer nodes rather than running directly on the CE.#

The general OSG submit node for VOs other than CMS expects software updates to be done on the submit node itself, so things are configured differently there. The writeable version of the file tree is mounted in /cvmfs and the read-only version is mounted in /cvmfs/pub. This means without further hacking, grid3-locations.txt is monitored from the writeable copy, which may be more recent than the version in /cvmfs/pub. We haven’t bothered to do further hacking to solve this, because we are under the impression that other VOs do not rely on grid3-locations.txt anyway. One possible solution for VOs who do not use grid3-locations.txt would be to have their CVMFS-writing process block and wait for the udpates to appear in /cvmfs/pub.#

Integration with glideinWMS#

We have access to the Open Science Grid via glideinWMS. How can we run jobs that read from CVMFS on OSG sites that do not mount our CVMFS repository? We achieved this by adding support for CVMFS in parrot and running the job under parrot. This allows us to access CVMFS purely in user-space, with no help from the administrator.#

This works best when the site provides a SQUID cache, so that many machines accessing the same CVMFS files can fetch them from the cache rather than always transfering them over the wide area network. If the site does not have a SQUID cache, we rely on reverse proxies in Wisconsin (using Varnish). This at least distributes the load over as many machines as we feel the need to deploy, but it makes inefficient use of the wide area network.#

Wrapping the job in parrot can be achieved in a variety of ways. We added a job wrapper in our glideinWMS frontend that runs the job under parrot if the job declares that it requires CVMFS. This frees the users from having to write their own wrapper. All they have to do is declare that their jobs need CVMFS. They do that by putting the following in their Condor submit file:#

+RequiresCVMFS = True

Wish List#

Based on our experience, the following CVMFS features would be useful:#

  • Automatic detection and correction of inconsistencies between the shadow tree and the repository. Basically, we would like a tool that can generate a diff and cause an update of the parts of the repository that are out of sync.#

  • Monitoring of cache thrashing: We assign a cache quota of 2GB plus 200MB/core, because this is more than sufficient for the worst case in which each core is running a different CMSSW release. However, we don’t know for sure that this is sufficient for other VOs. To detect thrashing of the local cache, the only option that currently exists is to scan in /var/log/messages for frequent instances of the cache hitting the high-water mark.#

  • Monitoring of cache usage per job: Currently, there is one local cache per repository. All jobs on the machine share it. There is no way to query how much of the cache is in use by a given job.#