CondorG Troubleshooting#

Common hold states observed in jobs submitted to OSG#

  • Globus error 7: an authorization operation failed#

    • This can be caused by a variety of GSI authentication issues: CA/CRL expiration, reverse DNS issues, time skew, proxy issues.
  • Globus error 9: the system cancelled the job#

  • Globus error 10: data transfer to the server failed#

  • Globus error 17: the job failed when the job manager attempted to run it#

  • Globus error 22: the job manager failed to create an internal script argument file#

    • (Maybe caused by bad condor.pm or disk full on CE.)
  • Globus error 31: the job manager failed to cancel the job as requested#

    • (This error happens anytime a condor_rm is issued to remove jobs that are in a Held (H) state. So far seen with jobs that were removed while affected by Globus error 10).
  • Globus error 93: GRAM Job submission failed because the gatekeeper failed to find the requested service#

    • (The appropriate jobmanager service is not available/configured in the $GLOBUS_LOCATION/etc/grid-services directory)
  • Globus error 94: the jobmanager does not accept any new requests (shutting down)#

    • This can be caused by the job in the remote queue being removed (e.g. by admin or PeriodicRemove).
  • Globus error 121: the job state file doesn’t exist#

    • This can be caused by loss of the contents of $OSG_LOCATION/globus/tmp/gram_job_state (e.g. during OSG upgrade on CE).
  • Globus error 122: could not read the job state file#

    • The jobs may have been removed at the site. You need to remove the copy from our end. condor_rm the jobs, for example:
      • condor_rm -f -constraint ‘GridResource == “gt2 ce01.cmsaf.mit.edu/jobmanager-condor” && NumSystemHolds > 3’
  • Globus error 155: the job manager could not stage out a file#

    • This can be caused by failure of the job causing one of the expected output files to not be produced or not exist in the expected location on the CE when the output is being copied back to the submit node. It is recommended to list all output files as input files and create 0-sized copies of these files before submission. This should avoid the common cases that cause Globus error 155 and make it easier to debug the real problem by allowing other files such as stdout/stderr to be staged back.