HDFS#

Services#

Namenodes#

Datanodes (Storage nodes)#

  • Only Storage node groups: s15, s17, s21
  • Storage nodes that are also compute nodes: g12, g14, g18, g19, g20, g2[2-9], g3[0-1]
  • Log: /var/log/hadoop-hdfs/hadoop-hdfs-datanode-$FQDN.log
  • Data: /data/??/hdfs
  • Configuration: /etc/hadoop/conf (puppet:/etc/puppet/modules/hadoop/manifests/datanodes.pp)
  • Daemon: hadoop-hdfs-datanode

SRM Servers#

Servers: #

 cmssrm.hep.wisc.edu (alias cmssrm1.hep.wisc.edu)
 cmssrm2.hep.wisc.edu (For local analysis jobs using farmout script)
  • Logs: /var/log/bestman2/{event.srm.log,bestman2.log}
  • Configuration: /etc/bestman2 (puppet:/etc/puppet/modules/osgrpms/manifests/srm_servers.pp)
  • Daemon: bestman2
  • GridFTP servers list: /etc/bestman2/conf/protocols (generated by cron)

GridFTP#

  • All datanodes except s21 nodes
  • Log: /var/log/messages
  • Configuration: /etc/gridftp-hdfs (puppet:/etc/puppet/modules/hadoop/manifests/gftp.pp)
  • Daemon: globus-gridftp-server
  • Useful commands to test gridftp server (e.g. g22n02.hep.wisc.edu) works:#

     globus-url-copy -v gsiftp://g22n02.hep.wisc.edu:2811//hdfs/store/user/tapas/tt .
    

xrootd manager and servers#

  • Redirectors: cmsxrootd.hep.wisc.edu (local), pubxrootd.hep.wisc.edu (global)
  • Supervisors: All the s15 nodes
  • Servers: All datanodes are Xrootd servers (check for xrootd_datanode in puppet:/etc/puppet/modules/cfeng/manifests/groups.pp)
  • Log: /var/log/xrootd/{xrootd.log,cmsd.log}
  • Configuration: /etc/xrootd (puppet:/etc/puppet/modules/hadoop/manifests/xrootd.pp)
  • Daemon: xrootd, cmsd
  • Useful command to test xrootd redirector and/or server is working#

    xrdcp -v -d 2 root://cmsxrootd.hep.wisc.edu//store/user/dan/test_file .
    xrdcp -v -d 2 root://g22n02.hep.wisc.edu:31094//store/user/dan/test_file .
    

Operations#

Decommissioning a datanode#

Add the hostname.domain to this file:#

   puppet:/etc/puppet/moduleshadoop/files/hadoop_root/conf/hosts_exclude

login to both namenodes and run on each:#

 sudo puppet agent -t

The decommissioning node will appear in the list of decommissioning nodes. Once the draining is done, the node will appear in the
list of dead nodes#

Checking a file#

Files in HDFS become unavailable or corrupted when datanodes hosting one or more of their blocks go offline or when one or more blocks become corrupted. To determine if a file is unavailable or corrupted, first run fsck on the NameNode:#

 hdfs fsck /<path>

Where / corresponds to the path visible on the FUSE mounted /hdfs/#

If hdfs fsck indicates that blocks are missing, find the datanodes that stored the blocks and bring them back online. #

To find the previous locations of a block, ssh to cron02 and run#

/usr/local/bin/hadoop/find-missing-hdfs-blocks --use-web

Which will check bad block locations against slightly stale location information but is a little faster than next command, which dumps current block location:#

/usr/local/bin/hadoop/find-missing-hdfs-blocks --run-fsck

For either of these commands one can watch the file /scratch/find-missing-hdfs-blocks for progress. #

If hdfs fsck shows that all of the file’s blocks are present, verify that the file is readable:#

hdfs dfs -get /<path> > /dev/null

If any of the blocks are corrupt, hdfs dfs -get will print a message like the following:#

get: Could not obtain block: blk_-7159385964596408947_2125807 file=/store/data/Commissioning10/ZeroBias/RECO/Sep17ReReco_v2/0048/627F4662-87C6-DF11-8553-001A928116E8.root

Since HDFS will replace corrupt blocks when other, valid, replicas are present, either locate a good copy of the block or replace the entire file.#

To proactively scan for invalid blocks, search the logs for messages like Block (blk_.*) is not valid.#

Updating block placement by rack#

When changing HDFS' rack-awareness, it is necessary to force the NameNode to re-replicate any blocks that may violate the new policy. After the policy is in place, select parts of the namespace and temporarily increase and then, once all of the blocks have an extra replica, decrease the replciation factor. The NameNode will almost always choose a datanode that complies with the new policy when creating the extra replica. Similarly, it will almost always choose a datanode that does not comply with the policy when reducing the number of replicas.#

hadoop fs -setrep -R 3 /syslogs
hadoop fsck /syslogs | tail -20 | grep replicated
hadoop fs -setrep -R 2 /syslogs

Cleaning old files from /hdfs/store/unmerged#

Dynamo now manages cleaning of unmerged files. The text below describes our own unmerged file deleter, which has been disabled. In the future, if Dynamo is no longer used, we may need to re-enable our deleter.#

The /store/unmerged directory holds files created by production jobs while they are running. After all jobs from a task have completed, their files are not needed anymore. There is a cron job (/usr/local/bin/delUnmerged/delOldUnmerged.sh) running on cron02 every night that deletes old files from this directory. Files have to be at least two weeks old and not being used by any running jobs to be deleted. Files are checked against this active file list produced by Unified#

The list of files to be deleted is stored in /var/log/store-unmerged-deleter. There is one list for each daily run going back two weeks. Older lists are removed.#

The output from each run of the deleter is logged in the file /var/log/clean-hdfs. This log file is managed by log-rotate.#

The deleter script has been customized for our use. It works around the Hadoop FUSE bug where listing empty directories causes a “resource unavailable” error. It is derived from the script documented here.#

References#