This is a simple, extensible package that may be used to run periodic sanity checks on a system. It may optionally be used as a Condor Startd Cron module that will publish test results in the machine ClassAd. This would allow you, for example, to automatically stop running jobs on machines that fail the self-test.
Simply unpack the "test_machine" tarball somewhere and if you are not on x86 Linux, cd src and make. The installation can be read-only, so it doesn't matter who owns the files as long as they are readable. In addition to the installation location, you will need to pick a location where test results are stored. This is referred to as the testing "workdir". It must be writable by the user that runs the tests and it should ideally have several times more space than the amount of RAM on your system or at least about half as much space as RAM.
Once you have finished the installation, you can run tests by hand like this:
/path/to/test_machine/bin/test_machine --fast /path/to/test_workdir
If the workdir does not already exist, test_machine will attempt to create it.
To run the tests as a Condor Startd Cron module, you need to add the equivalent of the following commands to your Condor configuration file:
STARTD_CRON_JOBS=test_machine::/pathto/test_machine/bin/test_machine_hawkeye:24h STARTD_CRON_test_machine_ARGS = /path/to/workdir
Make sure that the workdir is writable by the condor user.
Instead of having the tests run under Condor directly, you may call test_machine from elsewhere (like in a cron job) and just have test_machine_hawkeye monitor the results. Even if you do want Condor to run the tests, you may want a higher frequency ClassAd update in case additional tests are run, for example on a machine after making some repair. Simply add the following to your Condor configuration file:
STARTD_CRON_JOBS=test_machine_monitor::/pathto/test_machine/bin/test_machine_hawkeye:10m STARTD_CRON_test_machine_monitor_ARGS = --read_only /path/to/workdir
The test_machine command will exit with zero status if all tests pass. Results are stored in the test workdir in a file named "test_machine.state".
The first time test_machine is run, longer, more thorough tests are used. After a successful run, faster, less thorough tests are used. If one of the fast tests ever fails, the state is reset back to running the longer tests until the system passes. Long tests will also be periodically run, with the period configured in etc/test_machine.conf.
When used as a Hawkeye module, the following ClassAd attributes are reported:
PassedTest = True/False TestsPassed = {list of tests that passed} TestsFailed = {list of tests that failed} LastTest = {timestamp} LastFailedTest = {timestamp} LastPassedTest = {timestamp} TestFailureCount = {count of failed tests over all history} TestPassCount = {count of successful tests over all history} TestHistoryBegins = {timestamp}
The failure/pass counts are counts of results from test_machine as a whole, not counts of individual sub-tests that pass or fail.
To disable the running of jobs on a machine that has failed the test, you could insert the following into the START expression:
START = (...) && MY.PassedTest =!= False
If tests start succeeding at a later time, jobs will automatically be allowed again. If you would rather require manual intervention (like wiping the test_machine.state file) to resume jobs, you could use the following expression instead:
START = (...) && MY.TestFailureCount =?= 0
A few simple configuration parameters may be set in
etc/test_machine.config
. For example, you may set the
time between long tests and the time of day at which tests should
be run. View the config file for more information on options
that exist.
Adding more test modules is a simple matter. Each module is a subdirectory within the "tests" directory. The module subdirectory should contain an executable with the same name, for example tests/shm_test/shm_test.
For each test module, a working subdirectory is created within the workdir. The test module is then invoked like so:
shm_test [options] /path/to/module/workdir >& /path/to/workdir/shm_test.log OPTIONS: --fast (indicates that we are running in fast test mode) --long (indicates that we are running in long test mode)
If the module should only be run in long test mode, add it to the list of long modules in etc/test_machine.config.
If the module exits with exit status 0, the test is considered to have passed, otherwise it failed.