12.1 Trillion Digits of Pi

And we're out of disk space...

 

By Alexander J. Yee & Shigeru Kondo

 

(Last updated: December 28, 2013)

 

 

It's been 2 years since our last computation and it didn't look like anyone else was attempting to beat it. So we thought we'd see how much better we could do with improved software and two extra years of Moore's Law on computer hardware.

 

We didn't go for 20 trillion digits because of the massive space requirements. Instead we went for a "small and fast" computation of only 12.1 trillion digits.

3.
1415926535 8979323846 2643383279 5028841971 6939937510 : 50
5820974944 5923078164 0628620899 8628034825 3421170679 : 100
2597691971 6538537682 7963082950 0909387733 3987211875 : 4,999,999,999,950
6399906735 0873400641 7497120374 4023826421 9484283852 : 5,000,000,000,000
0874753286 1800417105 0417234177 3440660835 6303291247 : 5,000,000,000,050
0494788787 3350953767 0283816394 7603993291 8259328455 : 5,000,000,000,100
9544408882 6291921295 9268257225 1615742394 7483010753 : 9,999,999,999,950
9804871001 5982157822 2070871138 6966940952 1989228675 : 10,000,000,000,000
4392476662 7656619000 2124460557 5531593458 4820611421 : 10,000,000,000,050
7774158802 8607364081 0882791485 3424578359 2017915623 : 10,999,999,999,950
9022892022 4547999448 8759918307 1538211154 9129857424 : 11,000,000,000,000
8850838032 0621312483 8327044318 1257233570 9958940293 : 11,000,000,000,050
0225990312 0715120901 2625933350 2113726804 5410547637 : 11,999,999,999,950
9229203002 7190437071 4461941152 9088219512 2780551965 : 12,000,000,000,000
6281443745 8325315851 4346900240 7029751021 5197388736 : 12,000,000,000,050
2323632519 7260136778 4244176022 7323590892 5620919959 : 12,099,999,999,950
6850993485 4514351158 7381100302 4015062599 1743930275 : 12,100,000,000,000
2840983997 4140672353 4319030217 7521257272 1740258075 : 12,100,000,000,050

 

More digits can be found here: http://piworld.calico.jp/estart.html

 

Computation Statistics - All times are Japan Standard Time (JST).

 

Unlike the grueling 371 days that was needed for our previous computation of 10 trilllion digits, this time we were able to beat it in about 94 days.

 

The computation was interrupted 5 times - all due to hard drive errors. Furthermore, a bug was discovered in the software which had to be fixed before the computation could be continued.

 

Memory Statistics:

TiB = 240 bytes

 

 

Computation Statistics:

 

Verification Statistics:

 

 

Computation Step Time

Series Summation:

75 days

Division:

59 hours

Square Root:

38 hours

Final Multiply:

32 hours

Verify Multiply:

3.74 hours

Base Conversion:

123 hours

Verify Base Conversion:

36 hours

Output Digits:

56 hours
Time Lost to HW Failures: 12 days
Total Computation Time: 82 days
Total Real Time: 94 days

 

 

Hardware

As with the previous computations, Shigeru Kondo built and maintained the hardware. This time he had a new machine based on the Intel Sandy Bridge processor line.

Processor:

2 x Intel Xeon E5-2690 @ 2.9 GHz - (16 physical cores, 32 hyperthreaded)

Memory:

128 GB DDR3 @ 1600 MHz - 8 x 16 GB - 8 channels)

Motherboard:

Asus Z9PE-D8

Hard Drives:

Boot: 1 TB

Pi Output: 4 x 3 TB

Computation: 24 x 3 TB

Backup: 4 x 3 TB

Operating System:

Windows Server 2012 x64

 

And of course the obligatory pictures... (click to enlarge)

       
Close up of the motherboard. SATA Controllers Hard Drive Rack Hard Drive Fans

 

 

 

Methods and Improvements

The math and overall methods were the same as our previous computations:

The hexadecimal digits of the main computation matched that of the BBP formula:

The main computation was done using a new version of y-cruncher (version v0.6.2). The verification was done using the same BBP program from before.

 

Our previous computation of 10 trillion digits required 371 days. This time we were able to achieve 12.1 trillion digits in "only" 94 days. So it comes out to about 5x improvement in computational speed - which is a lot for only 2 years... (way ahead of Moore's Law)

 

The reason for this was a combination of several things:

 

Improved Fault-Tolerance:

 

Fault-tolerance is the most obvious improvement. A factor of 2 comes from just eliminating the impact of hard drive failures.

 

The key is the checkpoint-restart. This is the ability to save the state of the computation so that it can be resumed later if the computation gets interrupted.

y-cruncher has had checkpoint-restart capability since v0.5.4 where it was added specifically for our 5 trillion digit computation. But this original implementation turned out to be horrifically insufficient for the 10 trillion digit run.

 

The problem was that it only supported a linear computational model and thus restricted a computation a small number of checkpoints. The longer the computation, the longer the time between the checkpoints. In the 10 trillion computation, some of the checkpoints were as spread out as 6 weeks apart.

Meaning that: In order to make progress, the program had to run uninterrupted for up to 6 weeks at a time.

 

While it's easy to keep a computer running for 6 weeks, the MTBF of the 24 hard drive array was only about 2 weeks. And when the MTBF becomes shorter than the work unit, bad things happen. You basically get a downward spiral in efficiency. And this hit us hard: Several of the largest work units required many attempts to get through. Ultimately, it was basically a disaster since it nearly doubled the time needed to compute 10 trillion digits of Pi.

 

y-cruncher v0.6.2 solves this problem by completely rewriting and redesigning the checkpoint-restart system to allow for finer-grained checkpointing. This was done by adding the ability to save and restore the execution stack. The result is that those 6-week long gaps were reduced to a mere hours - or at most a few days. This boost in reliability enabled us to sail through the 12.1 trillion digit computation without too many issues.

 

In this computation, there were 5 hard drive failures - 4 of which wasted very little computation time. The 5th and last failure happened during the verification of the radix conversion. The radix conversion lacks checkpoints due to the in-place nature of the algorithm. So this final failure added about a week to the time that was lost.

 

There were a total of about 12 days with no useful computational work. This includes the time lost to hard drive failures as well as the time used to perform preemptive backups of the checkpoints.

 

 

Performance Optimizations:

 

If we assume that the fault-tolerance accounts for 2x of the 5x improvement, then the remaining 2.5x is from performance improvements.

The new version of y-cruncher adds a lot of performance enhancements. But it's difficult to quantize it since we also had better hardware.

 

A partial list of the improvments in y-cruncher v0.6.2 are:

The most interesting and unexpected of these is likely the last one. So that's the only one we will go into detail.

 

What did it do? It basically halved the amount of disk I/O needed to perform multiplications larger than 1 trillion digits. This improvement was completely unexpected and was a shocking (and pleasant) surprise to see for the first time. At these large sizes, the computation is largely bound by disk I/O. So any improvements to disk access translates directly to performance gain.

 

y-cruncher has always had a built-in software RAID. But version 0.6.1 does away with the old implementation and replaces it with a new (and overly elaborate) one that has native support for gather and scatter operations. (among other improvements)

 

The out-of-core FFT algorithm for large multiplication has non-sequential disk access. Normally, this algorithm can do a forward and inverse transform using only 3 passes over the dataset. This is known as the "3-step" algorithm. However, the larger the FFT, the more non-sequential the disk access becomes. At some point, the access disk access becomes so non-sequential that the run-time becomes dominated by seeks rather than reads or writes. This is a run-away effect. The number of disk seeks increases quadratically with the size of the FFT. So at some point, it becomes better to switch to the "5-step" algorithm which has fewer disk seeks, but needs nearly double the amount of disk access (in bytes).

 

In our 5 and 10 trillion digit computations, the threshold where we switched to the 5-step algorithm was approximately 1 trillion digits. So all multiplications above 1 trillion digits had to pay a penalty of 2x in the disk I/O. In the 12.1 trillion digit computation, this threshold was increased to 10 trillion. How did that happen?

Based on emperical evidence, this new approach reduces the cost of seeks by about factor of 2 - 4. The improvement increases with the number of drives.

Because the cost of seeking had dropped so dramatically, it allowed the program to be more aggressive with using the 3-step algorithm. This ultimately increased the cross-over threshold of the 3-step and 5-step algorithms to 10 trillion digits.

 

Nevertheless, this speedup didn't come without a cost. The new RAID system was one of the most difficult and time-consuming components to implement and test.

 

 

New Difficulties

Our 5 trillion digit computation had virtually no issues. There was one CPU or memory error early on. But nothing big.

But our 10 trillion digit computation was plagued with hard drive failures. These were so bad that it nearly doubled the time needed for the computation.

 

For this 12.1 trillion digit computation, we fixed or worked around all of these issues. But when you get rid of existing bottlenecks, you expose new ones...

This time, the new bottleneck is disk capacity.

 

A 20 trillion digit computation of Pi would need a lot of disk space. Using y-cruncher v0.6.3, it would require:

Even with the largest drives available (4 TB), it would have required a prohibitively large number of drives. While it's theoretically possible to have over 100 drives hooked up to a single motherboard, there is a point where it becomes... insane - even if you forget about cost. And if the MTBF drops below a few days, then you'll run into the same run-away time loss from hard drive failures. Getting around this would require stacking even more redundancy which further increases the capacity requirement. In any case, there's a limit somewhere.

 

It's possible to decrease the memory requirement at the cost of performance. But it cannot be reduced by more than a factor of two before the performance drop becomes a run-away effect. So beyond that, new math will be needed to develop algorithms that use even less memory.

 

In other words: Computing Pi on commodity hardware is nearing the point where it is at the mercy of Moore's Law for storage capacity.

But there's not much that can be done about storage (other than to get more of it). Random binary data doesn't compress very well either...

 

 

 

Detailed Timeline

 

Date Event Screenshot
September 25, 2013

Start of Computation

View
October 1, 2013

Series summation is 10% complete.

 
October 2, 2013

Week 1: Series summation is past 10% complete.

View
October 3, 2013

Series summation is 13% complete.

 
October 5, 2013

Series summation is 16% complete.

 
October 8, 2013

Series summation is 21% complete.

 
October 9, 2013

Week 2: Series summation is past 21% complete.

View
October 12, 2013

Series summation is 27% complete.

 
October 16, 2013

Week 3: Series summation is past 27% complete.

View
October 17, 2013

Series summation is 34% complete.

 
October 23, 2013

Week 4: Series summation is past 34% complete.

View
October 24, 2013

Series summation is 44% complete.

 
October 30, 2013

Week 5: Series summation is past 44% complete.

View
November 1, 2013

Series summation is 56% complete.

 
November 6, 2013

Week 6: Series summation is past 56% complete.

View
November 12, 2013

Series summation is 71% complete.

 
November 13, 2013

Week 7: Series summation is past 71% complete.

View
November 20, 2013

Week 8: Series summation is still past 71% complete.

View
November 23, 2013

Series summation is 85% complete.

 
November 24, 2013

Hard Drive Error: Read error followed by a spontaneous dismount of the hard drive.

A bug is discovered in y-cruncher's checkpoint resuming function.

So a patch had to be sent to Shigeru Kondo to properly resume the computation.

View
November 27, 2013

Week 9: Series summation is past 85% complete.

View
December 3, 2013

Hard Drive Error: Spontaneous dismount of a hard drive. (the same hard drive as before)

 
December 4, 2013

Week 10: Series summation is still past 85% complete.

View
December 5, 2013

Hard Drive Error: Read error on the same hard drive. Spontaneous dismount of a different hard drive.

Both hard drives are replaced and a backup of the computation is made before resuming.

 
December 9, 2013

Series summation is complete.

 
December 10, 2013

Division is complete.

 
December 11, 2013

Week 10: Performing Square Root.

View
December 14, 2013

Day 80: Computation of binary digits is complete. Results match BBP formula.

View
December 17, 2013

Day 83: Base conversion is in progress.

View
December 18, 2013

Week 12: Still performing base conversion.

View
December 20, 2013

The conversion is complete and the digits are written to disk.

The digits match the 10 trillion digit computation.

 

Hard Drive Error: Read error.

The hard drive is replaced. The computation is rolled back to the start of the base conversion.

 
December 25, 2013

Week 12: Still performing base conversion.

View
December 28, 2013

Day 94: Computation is Complete

View

 

 

 

Development Environment

 

For those curious of the programming environment that was used to develop the y-cruncher software... (if can even be considered a "programming environment")

It didn't always look exactly like this... The computers change regularly. And the monitors weren't as big when I was still in school.

 

But what did stay constant was the large variety of computers that were used to develop and tune the y-cruncher program (among other things). This variety is necessary for developing code that performs well on different hardware and is also a side-effect of building new machines and never tossing out the old ones.

 

As of this writing, there are 6 active machines - 3 of which are in the pictures above:

Processor(s) Generation Memory Primary Storage Swap Configuration
Core i7 920 3.5 GHz (OC) Intel Nehalem 12 GB DDR3 1333 MHz 2 TB -
Core i7 3630QM 2.4 GHz Intel Ivy Bridge 8 GB DDR3 1600 MHz 1 TB -
FX-8350 4.2 GHz (OC) AMD Piledriver 16 GB DDR3 1333 MHz 2 TB -
Core i7 4770K 4.0 GHz (OC) Intel Haswell 32 GB DDR3 1866 MHz 1.5 TB 4 x 1 TB + 4 x 2 TB
4 x Opteron 8356 2.31 GHz AMD Barcelona 8 GB DDR2 533 MHz 80 GB + 1 TB -
2 x Xeon X5482 3.2 GHz Intel Harpertown 64 GB DDR2 800 MHz 64 GB SSD + 2 TB 8 x 2 TB

While these aren't as powerful as the computer we used for the computation, most of them are at least mainstream enthusiast hardware. There used to be more of them, but most are retired. One was handed down to a relative. One was left behind with a friend in Illinois when I left school. And one was recently destroyed due to a careless short circuit (a Core i7 2600K @ 4.6 GHz).

 


Questions or Comments

Contact us by e-mail.

 

You can also find me on: