(Last updated: March 7, 2011)
Time to step back a bit... On some cheaper hardware...
On February 20, 2010, I have sucessfully computed and verified the constant e to 500,000,000,000 decimal places.
The first 200 billion digits agree with Shigeru Kondo and Steve Pagliarulo's previous computation. (May 2009)
7182818284 5904523536 0287471352 6624977572 4709369995 : 50
9574966967 6277240766 3035354759 4571382178 5251664274 : 100
8441770376 8631104974 6968465060 3035694339 6205930077 : 199,999,999,950
6894613668 1423327791 1840461713 7962025153 2608551312 : 200,000,000,000
8061629499 9889152937 5461889683 8218812430 6075311581 : 200,000,000,050
1263745760 3773674524 6963968004 5516655446 1704769964 : 499,999,999,950
7320021939 7617472530 4989685177 3671516511 2702955878 : 500,000,000,000
The digits are available for download here.
Computation Statistics - All times are Central Standard Time (CST).
It should be noted that the computer used was also one of my primary computers at the time. So unlike most size-record computations, the computer was actively used for normal-everyday purposes throughout the entire computation. (web-browsing, gaming, music, Anime...)
Therefore, the timings may be a little bit inflated than if it were left alone.
Main Page: y-cruncher - The Multi-Threaded Pi-Program
Formulas used for computation
The following two formulas were used:
Taylor series of exp(1):
Taylor series of exp(-1):
Standard Binary Splitting was used to sum up the series in an efficient manner. Because the formulas are so similar, the two series were summed up together in the same routine so that redundant parts of their computations could be re-used. (Namely their factorials.)
At the end, two divisions were performed producing the binary digits as computed from the two formulas. Their results agreed.
Two base conversions using different cutting and scaling parameters were then done to obtain the decimal digits. Those also agreed.
The detailed computation statistics are as follows:
|Fused Binary Splitting of the series||75.4 hours (3.1 days)|
|Final Division: 1/exp(-1)||31.3 hours (1.3 days)|
|Final Division: exp(1)||31.9 hours (1.3 days)|
|Write Hexadecimal Digits: (Both)||6.4 hours|
|Construct Base Conversion Tables||17.5 hours (0.7 days)|
|Base Convert: exp(1)||67.7 hours (2.8 days)|
|Write Decimal Digits: exp(1)||3.73 hours|
|Base Convert: 1/exp(-1)||68.9 hours (2.9 days)|
|Write Decimal Digits: 1/exp(-1)||3.73 hours|
|Total Time:||307 hours (12.8 days)|
After completing the second division, there was a minor issue that kept the program from writing the digits to disk. (Windows apparently does not accept directory or file names that start or end with a space.) However, the results of both computations were still preserved on the swap disks. So all it took was a simple fix and a recompile to resume the computation.
No time was lost because this issue was realized well before it occurred. So the fix was prepared well ahead of time in anticipation of the failure.
Since the computation was not done in a single continuous run, no screenshot is available for the computation. (Though it would've been nice to get a full screenshot with CPU-Z validation.)
A Small Note on Speed
The method that was used for computing and verifying is far from optimal. There were plenty of places where the program could be improved - especially with the radix conversions. For example, two binary -> decimal conversions for verification is unnecessary as it is faster to convert to decimal once and convert back.
Alternatively, it is possible to verify a conversion by using modular checks over prime numbers. (As was done in the latest record for Pi.)
This completely avoids the need for a second conversion. (The program already does this to some extent, but it is not quite sufficient to qualify as a verification of the full conversion.)
Because e is a fast constant to compute (in comparison to other constants), the divisions and the radix conversions dominate the total run-time. However, neither of these operations were well-optimized in any way since the program was originally written for the much slower constants (i.e. Euler-Mascheroni Constant) where divisions and conversions are negligible to total run-time.
Nevertheless, the program was efficient enough to perform the computation on a desktop within a reasonable amount of time.
However, letting it drag on for more than 10 days was pretty painful since it tied down my primary entertainment computer which I use for video-intensive gaming as well Anime. (And for reasons that will be explained later, my workstation could not be used for this computation.)
Not much here. Most of the arithmetic algorithms have been largely unchanged since last year's computation of the Euler-Mascheroni Constant.
Virtually all the math and all the algorithms have been untouched. Nearly all improvements to the program since then were done at the implementation level.
The only thing worth mentioning is that large multiplications on disk were done using the standard 3-step and 5-step convolution algorithms - which are the best known approaches for minimizing disk access. But even then, disk products are still highly limited by bandwidth. So for large products, every effort was made to try fitting them into memory. In such cases, the program will attempt in-place FFTs, CRT convolution reductions, and even go as far as sacrificing parallelism to fit a particular product in memory. Only when all else fails does it use 3-step convolution on disk.
With such a headache involving disk access, it only made sense that the existing multiple hard drive support (in Basic Swap Mode) was extended to include the large swap computations (Advanced Swap Mode). And as such, a software raid layer was built into the program that allows dynamic striping of an unlimited number of hard drives. The dynamic striping provides the flexibility of allowing the program to choose its own stripe-sizes and freely switch back and forth between interleaved striping and append striping. Although currently not used, this will allow the program to optimally select the trade-off between fast sequential access (interleaved striping) vs. fast strided access (append striping).
For this particular computation, only 4 hard drives were used - which could have easily been combined using hardware Raid 0. But in the interest of testing the program's multi-hard drive feature, the 4 hard drives were kept as seperate logical volumes.
In addition to vamping up disk bandwidth by cramming in more hard drives, overclocking was used to speed up the computer and further increase its computing potential.
With 4 hard drives and nearly 500 MB/s of disk bandwidth, swap computations even at this size are no longer dominated by disk access.
In fact, more than half of the actual run-time is spent limited by CPU and memory rather than disk. Therefore, overclocking is a worthwhile option to further increase performance.
For this type of application, a lot of effort is usually put into squeezing out every last percent of performance from the software.
However, many fail to realize that the hardware can be a great opportunity for improvment as well. With the right tools and skills, it is usually possible to squeeze out a generous amount of performance out of the hardware.
Problems with Overclocking
The greatest concern of overclocking is of course stability and reliability. An improperly overclocked system can randomly crash or produce errors when under heavy load with intense heat and stress. Even a seemingly stable system that passes all stress tests such as Prime95 and Linpack may go unstable after days of heavy load and extreme stress. There may be a variety of reasons such as higher ambient temperatures, vdroop, or simply a degrading processor...
To make matters worse, this type of computation does not tolerate errors of any kind. Even though there is error-correction built into the program itself, it is best to not rely on it since most instability will usually result in a BSOD or a system crash of some sort - which the program cannot recover from.
However, overclocking when done properly can be as stable as an non-overclocked system. It can therefore be an effective way to increase the performance of a system without any expensive hardware upgrades. It does, however, void the warranties of the hardware. But rarely is that a problem since (most) computer hardware will last much longer than their warranties when given enough care.
First we must understand why overclocking is possible. Why would Intel or AMD sell a processor at xx GHz if it is capable of clocking higher? (excluding marketing and economical reasons)
There are many reasons to this - many of which are obviously proprietary. But one of the more common reasons is simply heat.
Retail processors are often sold with a set gurantee on its lifetime expectancy. However, the MTTF (Mean Time to Failure) of the transistors that reside in them is exponentially related to the sustained operating temperature. The equation that governs this is given by Black's equation for Electromigration:
A = Constant
w = Width
j = Current Density
Qa = Activiation energy of the material
k = Boltzmann Constant
T = Absolute Temperature
So every degree increase in sustained operating temperature can drastically reduce the life of the processor. And conversely, reducing the sustained operating temperature can exponentially increase the life of the processor.
Overclocking a processor by increasing its frequency and voltage obviously increases its power consumption and heat production. This in effect decreases the life of a processor. So the processor will (usually) be able to operate at higher frequencies will absolute stability, but it simply won't last as long. (Though this reasoning is highly simplified since there are many other factors that come into play.)
The simple solution is to find a better way to dissipate the extra heat produced from overclocking. Retail processors are often sold with heatsinks that are "barely enough" to keep the processor from overheating. However, there are numerous 3rd-party heatsinks and coolers that are of much higher quality and can dissipate much more heat. These can range from tower heatsinks with heat-pipes, to water cooling, to phase-change, to Liquid Nitrogen. (although the latter is not practical for extended use)
By investing in some of these after-market coolers, it is possible to reduce the temperature of the processor enough where it can be safely overclocked with little or no adverse effects.
I should note that heat is obviously not the only problem in overclocking a computer. But it is presented here a simple example of how a barrier to overclocking can be taken down with a little investment and tweaking.
Using standard overclocking techniques, the computer was easily overclocked to 3.5 GHz with no increases in voltages and only slight changes in memory timings. The processor was then further pushed up to 3.8 GHz - at which all the standard stability tests were run. (Prime95, Linpack, etc...)
With only slight bumps in voltages and slightly relaxed memory timings, the system was able to pass all stability tests at 3.8 GHz for several hours. In particular, all stress tests except for Linpack were able to run for several hours without error at 3.9 GHz or higher.
To ensure stability, the entire system was clocked backed down slightly (via lowering the base clock) so that the processor was down to 3.5 GHz. All other settings (including voltages) were kept constant. This effectively provided a 300 MHz safety margin for which ensure stability.
Overclocking the memory from 1066 MHz to 1333 MHz followed a similar process. Except that very little safety margin was used since the memory was "rated for overclock @ 1333 MHz under the right conditions". The uncore frequency was also overclocked (from 2.17 GHz to 2.83 GHz) using the same standard overclocking techniques.
It should be noted that this overclock was not done soley for the purpose of this computation. The computer had always been running at 3.5 GHz (or higher) since the second day it was built (July 2009). Since then, the computer has been put under extremely stressful workloads for months at a time at 3.5 GHz, and 3.8 GHz. Not once has a single error or crash ever occur at 3.5 GHz. And only once did 3.8 GHz fail (resulting in a BSOD), but that was during the first week of testing it when no safety margins were applied yet.
3.5 GHz was chosen as the frequency to use because the minimum fan speed that was needed to keep 3.8 GHz from overheating was too loud for my roommate and I to handle for more than a few hours... (For 3.8 GHz to be stable with a 300 MHz safety margin, a significant increase in voltage was needed to allow the system to pass all stress tests at 4.1 GHz...)
There have been multiple instances where the cpu was clocked to 4.2 GHz and higher, but none of those sessions were sustained for more than a few hours due to overheating issues. Hence, the processor is clearly capable of much higher frequencies if overheating wasn't a problem.
The final frequency was set at 3.33 GHz (3.5 GHz turbo boost) - which is a 25% increase from the default frequency of 2.66 GHz (2.80 GHz turbo boost).
This effectively made the processor run slightly faster than the much more expensive Core i7 Extreme 975 - which runs at 3.33 GHz (3.46 GHz turbo boost).
At 3.5 GHz, the load temperature of the hottest core ranged from 70C - 76C in slient mode (all fans to minimum) and 65 - 70C in normal mode. These corresponded to ambient temperatures of roughly 73F - 80F.
With all fans set to maximum speed, the processor is capable of sustaining 4.2 GHz. Under load with an ambient temperature of 80F, the hottest core will reach 80C - 83C - which is also near the point where the processor begins to throttle down the turbo boost. If water cooling is used, it might be possible to achieve stability at upwards of 4.5 GHz with this particular processor. (Though the voltages that would be required to sustain that would likely be unsafe for the processor...)
Despite all the drawbacks of overclocking, I find that debugging hardware instability is no different from debugging multi-threaded code.
The original plan was to attempt 1 trillion digits using the same workstation that was used in last year's computations.
That workstation has:
2 x Intel Xeon X5482 @ 3.2 GHz (8 physical cores)
64 GB ram
750 GB + 4 x 1 TB HDs with 400MB/s total disk bandwidth.
But technical difficulties along with being far away from home (and no physical access to the machine), meant that I could not use my workstation for such a computation until at least Spring Break. So the only other option was to use the next best computer I had, my overclocked desktop.
(Originally that workstation was in my dorm room in Chicago. But after the school year ended, I moved it home to California. Since then, I've been remote controlling it.)
As stated near the top of this page, my desktop has:
Intel Core i7 @ 3.5 GHz (4 physical cores)
12 GB ram
1.5 TB + 1 TB HDs
Which is of course, a much less capable machine...
Since this desktop was my primary programming box and entertainment center, It would have been quite problematic to tie it down for an extended period of time.
(This is in contrary to last year. That was back when my workstation was still with me in Chicago - when it was in total excess and had absolutely no other purpose besides gaming and bragging rights.)
So in the end, I decided to grab a few extra hard drives and go with a smaller computation - using my little desktop...
Questions or Comments
Contact me via e-mail. I'm pretty good with responding unless it gets caught in my school's junk mail filter.