
Not infrequently the job crashes during run-down, and there is no event to be stored, so it stores trigger number -1 :-) instead. In such a case the job tends to crash 4 times, since removing the event doesn't change anything.
We have some bad processors. fncdf171 crashed 12 runs of 38 submitted to it--a factor a 8-10 worse than the rest.
We have memory leaks. Several of them. This plot shows the progress of a Production run which crashed. You can easily see three different types of memory leaks: a slow, a medium, and a catastrophic.
If we look at a run that didn't crash, we can see that the medium memory leak is an occasional thing--large jumps and then only small growth for a while. Liz says this is a known feature of the KAI compiler--it is easier and faster to grab for more memory than try to do garbage collection, I guess. It grabs it in really big chunks. Probably the gnu compiler is similar. D0 hacked the vector class in KAI to deal with this problem.
If we look at a single event we can see how why we get catastrophic failures. This event crashed a Production job but runs successfully on its own. Presumably existing memory leaks left the job vulnerable to a greedy event like this. The log file reports that most of the time was spent in SiClusteringModule and SiPatternRecModule. Curiously, evd doesn't show any silicon hits at all.
An even more dramatic event shows memory use vs CPU time. According to the log file, 146 seconds of this event were spent in SiPatternRecModule. (See /cdf/scratch/cdfopr/testRel/5.1.1/TortureTest for the event.)
Modified 28-October-2003 at 10:30
http://hep.physics.wisc.edu/~jnb/imu/29Oct2003