Experience on Grid3 showed that jobs submitted through the Globus gatekeeper sometimes get orphaned: Globus gives up on the job but the batchsystem does not. This is especially noticable when the batchsystem handling the job is Condor, because Condor doesn't like to give up without successfully running the job. If the job's stdin/stdout/executable files in the GASS Cache are removed by Globus, then Condor will fail to start the job, assume there is some transient filesystem failure, and keep trying to run it periodically until somebody manually removes the orphaned job from the queue.
Known causes of orphaned jobs (as of VDT 1.1.11):
To solve this problem, a patch was created for the Condor jobmanager. It uses Condor's periodic_hold expression to place an upper-bound on how long Condor should continue to keep the job in its queue. As long as the jobmanager is alive, it will poll the status of the job and increase the upper-bound as needed. If the jobmanager goes away and the remaining time expires, the job will be placed on hold.
This prevents Condor from further wasted attempts to run the job (which can otherwise interrupt or stand in the way of other jobs). If the jobmanager somehow comes back and regains control over the job, the hold will be automatically released and the job will continue without any special attention.
It is still left to the administrator to periodically remove orphans that have gone into a hold state, partly as a way of advertising the fact that orphans are being created.
Note that the current version of Condor (V6.6.X) does not update the copy of the job ClassAd held by the shadow process when changes are made by the jobmanager to the job ad in the Condor schedd queue. This means that jobs which become orphaned while they are running will not be halted until they stop running or are evicted.