How to turn off Condor#

How to shut down Condor on a single machine without killing jobs#

This example initiates a peaceful shutdown of condor on g12n01.hep.wisc.edu. Running jobs are allowed to finish, but no new ones are allowed to start. Once all jobs are gone, the condor_startd is stopped and it will no longer appear in condor_status. The condor_master continues to run, so puppet will not start condor again, because it will consider condor to be running. It is possible to start things back up remotely using condor_on.#

This command should be run from a machine with administrative rights. Currently, that means condor.hep.wisc.edu or condor02.hep.wisc.edu. Be _very_ careful when running these commands. A small typo could cause the whole condor pool to shut down!#

condor_off -peaceful -startd g12n01.hep.wisc.edu

To see if condor is finished shutting down, check to see if condor_startd is still running on the machine, or run the following command:#

condor_status g12n01.hep.wisc.edu

Once the condor_startd has shut down, the output from the above command will be empty.#

If the machine reboots, condor will start back up normally, allowing jobs to run. There is no persistent memory that the startd was turned off.#

How to shut down Condor on a rack without killing jobs#

This example initiates a peaceful shutdown of condor on the g12 rack. Running jobs are allowed to finish, but no new ones are allowed to start. Once all jobs are gone, the condor_startd is stopped and it will no longer appear in condor_status. The condor_master continues to run, so puppet will not start condor again, because it will consider condor to be running. It is possible to start things back up remotely using condor_on.#

This command should be run from a machine with administrative rights. Currently, that means condor.hep.wisc.edu or condor02.hep.wisc.edu. Be _very_ careful when running these commands. A small typo could cause the whole condor pool to shut down!#

condor_off -peaceful -startd -constraint 'regexp("^g12n",machine)'

If the machine reboots, condor will start back up normally, allowing jobs to run. There is no persistent memory that the startd was turned off.#

How to shut down Condor on a rack quickly, killing jobs#

The fastest way is to just turn off the machine.#

A more graceful way is to initiate a fast shutdown. This example initiates a fast shutdown of condor on the g12 rack. Once all jobs are gone, the condor_startd is stopped and it will no longer appear in condor_status. The condor_master continues to run, so puppet will not start condor again, because it will consider condor to be running. It is possible to start things back up remotely using condor_on.#

This command should be run from a machine with administrative rights. Currently, that means condor.hep.wisc.edu or condor02.hep.wisc.edu. Be _very_ careful when running these commands. A small typo could cause the whole condor pool to shut down!#

condor_off -fast -constraint 'regexp("^g12n",machine)'

If the machine reboots, condor will start back up normally, allowing jobs to run. There is no persistent memory that the startd was turned off. One way to prevent jobs from running in that situation is to change the Start expression to be false on the machines that should not run jobs. For example, the following could be appended to the Start expression (defined in puppet in 00hep_wisc.config.erb).#

 && regexp("^g12n",MY.Machine) =!= True

How to turn Condor back on#

Unless the -master option is specified to condor_off, the master process continues to run and can therefore be remotely administered. To turn on the condor daemons that were previously shut down, one can use the condor_on command.#

This command should be run from a machine with administrative rights. Currently, that means condor.hep.wisc.edu or condor02.hep.wisc.edu.#

condor_on -constraint 'regexp("^g12n",machine)'

Note that this command has no effect if a shutdown is still in progress. There is currently no way to cancel an ongoing shutdown.#

To see which machines are not running a startd, the following commands can be used:#

condor_status -master -f "%s\n" machine | sort | uniq > /tmp/masters
condor_status -startd -f "%s\n" machine | sort | uniq > /tmp/startds
join -v 1 /tmp/masters /tmp/startds

If the Start expression was modified to prevent jobs from starting, reset the expression to normal.#

How to quickly restart condor on machines where stuck jobs are preventing it from restarting automatically (e.g. during an upgrade)#

Run the following command on the central manager (condor.hep.wisc.edu):#

condor_status -const ' TotalCondorLoadAvg < 0.1 && regexp("g[0-9]",machine)' -af machine | sort | uniq > /tmp/idle_machines
xargs -n1 < /tmp/idle_machines condor_off -fast

After the startds shut down, the above condor_status command will no longer see them. That is why we save the list of machines to a file, so condor can still be restarted. Example:#

xargs -n1 < /tmp/idle_machines condor_restart