(Last updated: July 19, 2009)
In designing the program I had already anticipated the possibility of hardware errors. The probability of encountering a hardware error grows exponentially with the length of a computation. Therefore, even workstations, (which are literally designed for this kind of abuse), are still prone to hardware errors for extremely long computations. Given the nature of computing constants, there is absolutely no tolerance for errors since a single error can propagate through and completely ruin a computation. If hardware errors are common enough, large computations would become infeasible.
As a result, y-cruncher was rigged with redundancy checks in the most likely places where an error could occur. But, the point is: I never really expected to actually encounter a hardware error - especially on a non-overclocked workstation. And as a result, I never took the next step - which was to incorporate error-correction. And this decision came back to bite me.
Therefore, during the computation run, error-detection caught the error and quit the program on the spot... That error was an easily recoverable one - had I implemented error-correction as well, the computation would likely have finished on its own with minimal impact on run-time.
Two lessons learned: ALWAYS monitor your hardware!!! And NEVER cut corners in programming... If it's real life, it'll come back to haunt you (as in this case). If it's a homework assignment, your professor will dock you... (as is the case with me waaaaayyy toooooo often...)
So I learn from my mistakes...
y-cruncher now has built-in error-detection AND error-correction. If an error occurs and it isn't in any sort of pointer artihmetic or flow control, there is a decent chance that y-cruncher will be able to detect it and recover from it. Furthermore, the program will be able to identify whether an error is hardware-related or software-related.
This will also allow the program to recover from minor programming bugs and continue running. (Though I'm very confident that the core arithmetic module of y-cruncher is free of bugs. The implementation has been carefully designed and rigorously tested with all loose-ends and special cases being taken care of. The only thing that can go possibly wrong with the arithmetic module is round-off error from floating-point FFTs - which itself is tuned to very conservative settings... But just in case I missed something... y-cruncher should be able to recover from it, finish the computation, and even tell me where the bug is so that it can be fixed.)
In the future, I may also need to add an extra layer of ECC to all I/O operations since I've heard that hard drives are even more prone to errors...
As a side note: Solid State Drives are out of the question because of the # of write cycles that y-cruncher will subject them to. Although, hard disks have moving parts, the I/O patterns that y-cruncher exhibit are generally large sequential reads and writes (typically on the order of hundreds of megabytes) that involve very little head movement - so there is minimal mechanical wear and tear.