Download

Machine Test Utility

This is a simple, extensible package that may be used to run periodic sanity checks on a system. It may optionally be used as a Condor Startd Cron module that will publish test results in the machine ClassAd. This would allow you, for example, to automatically stop running jobs on machines that fail the self-test.

Installation

Simply unpack the "test_machine" tarball somewhere and if you are not on x86 Linux, cd src and make. The installation can be read-only, so it doesn't matter who owns the files as long as they are readable. In addition to the installation location, you will need to pick a location where test results are stored. This is referred to as the testing "workdir". It must be writable by the user that runs the tests and it should ideally have several times more space than the amount of RAM on your system or at least about half as much space as RAM.

Once you have finished the installation, you can run tests by hand like this:

/path/to/test_machine/bin/test_machine --fast /path/to/test_workdir

If the workdir does not already exist, test_machine will attempt to create it.

To run the tests as a Condor Startd Cron module, you need to add the equivalent of the following commands to your Condor configuration file:

STARTD_CRON_JOBS=test_machine::/pathto/test_machine/bin/test_machine_hawkeye:24h
STARTD_CRON_test_machine_ARGS = /path/to/workdir

Make sure that the workdir is writable by the condor user.

Instead of having the tests run under Condor directly, you may call test_machine from elsewhere (like in a cron job) and just have test_machine_hawkeye monitor the results. Even if you do want Condor to run the tests, you may want a higher frequency ClassAd update in case additional tests are run, for example on a machine after making some repair. Simply add the following to your Condor configuration file:

STARTD_CRON_JOBS=test_machine_monitor::/pathto/test_machine/bin/test_machine_hawkeye:10m
STARTD_CRON_test_machine_monitor_ARGS = --read_only /path/to/workdir

Behavior

The test_machine command will exit with zero status if all tests pass. Results are stored in the test workdir in a file named "test_machine.state".

The first time test_machine is run, longer, more thorough tests are used. After a successful run, faster, less thorough tests are used. If one of the fast tests ever fails, the state is reset back to running the longer tests until the system passes. Long tests will also be periodically run, with the period configured in etc/test_machine.conf.

When used as a Hawkeye module, the following ClassAd attributes are reported:

PassedTest = True/False
TestsPassed = {list of tests that passed}
TestsFailed = {list of tests that failed}
LastTest = {timestamp}
LastFailedTest = {timestamp}
LastPassedTest = {timestamp}
TestFailureCount = {count of failed tests over all history}
TestPassCount = {count of successful tests over all history}
TestHistoryBegins = {timestamp}

The failure/pass counts are counts of results from test_machine as a whole, not counts of individual sub-tests that pass or fail.

To disable the running of jobs on a machine that has failed the test, you could insert the following into the START expression:

START = (...) && MY.PassedTest =!= False

If tests start succeeding at a later time, jobs will automatically be allowed again. If you would rather require manual intervention (like wiping the test_machine.state file) to resume jobs, you could use the following expression instead:

START = (...) && MY.TestFailureCount =?= 0

Configuration

A few simple configuration parameters may be set in etc/test_machine.config. For example, you may set the time between long tests and the time of day at which tests should be run. View the config file for more information on options that exist.

Test Modules

cp_test
This is a quick test to see if a file can be copied without errors. This is less likely to catch media errors as it is to catch memory or PCI bus corruption.
diff_test
This is only run in long-test mode. It tests to see if identical files can be compared successfully. The main intention is to catch memory problems that are only visible at the high speeds achieved during DMA data transfers.
memtester
This is a memory testing package created by Charles Cazabon. See src/memtester for more information. This test performs better when run as root, since it can lock memory in place, but running as root is not an absolute requirement.
shm_test
This is yet another test aimed at catching memory corruption by reading and writing data to /dev/shm. You can control the specific path used for testing in test_machine.config.

Adding Additional Test Modules

Adding more test modules is a simple matter. Each module is a subdirectory within the "tests" directory. The module subdirectory should contain an executable with the same name, for example tests/shm_test/shm_test.

For each test module, a working subdirectory is created within the workdir. The test module is then invoked like so:

shm_test [options] /path/to/module/workdir >& /path/to/workdir/shm_test.log

OPTIONS:
  --fast      (indicates that we are running in fast test mode)
  --long      (indicates that we are running in long test mode)

If the module should only be run in long test mode, add it to the list of long modules in etc/test_machine.config.

If the module exits with exit status 0, the test is considered to have passed, otherwise it failed.


Author: Dan Bradley