y-cruncher - Language and Algorithms
By Alexander J. Yee
June 14, 2013)
Source Code (as of v0.6.2):
- 95% of y-cruncher is written in C99. 5% of y-cruncher requires C++11. The entire program is compiled as C++11.
- Some inline assembly is used.
- Intel SSE and AVX compiler intrinsics are used.
- As of June 14, 2013, y-cruncher has about 276,000 lines of code for a total size of approximately 9.5 MB.
- y-cruncher is closed source.
Libraries and Dependencies:
- Windows: WinAPI
- Linux: POSIX
- y-cruncher has no other 3rd party dependencies.
- All Windows binaries are compiled using Microsoft Visual Studio 2012.
- All Linux binaries are compiled using GCC.
Other Internal Requirements:
- Little-endian addressing
- Sign-fill (arithmetic) right-shift
- IEEE double-precision floating-point
- y-cruncher's design is split into 3 major layers. There's an additional interface layer at the very top, but that's not as interesting.
- Math Layer: This is where all the constants and functions are implemented. This is done largely using very simple object-oriented programming structured in a mathematical way that most resembles the formulas/algorithms that are involved.
Roughly 20% of the source code is in this layer and is written in C++11.
- Object Layer: This is where all the "number" objects are defined. (Large integer, large floating-point, etc...)
This layer most resembles a C++/Java class library but without any fancy stuff like inheritance. (everything compiles in C99)
Roughly 2% of the source code is in this layer.
- Modules Layer: Here's where it gets serious. More than 70% of the entire source code of y-cruncher resides here. The Modules layer implements all the core operations needed by the upper layers. This is where most of the performance critical code is. All that nasty stuff like FFTs, CPU-dispatching, loop-unrolling, etc... is all down here. Unlike the Math and Object layers, there is no structure in the Modules layer. It's exactly as the name implies: A collection of (mostly) independent modules that implement different low-level operations needed by y-cruncher.
Together, these modules implement an interface of about 100 or so different functions. As a normal part of development, new modules are constantly being added and old ones phased out and retired. So at any one time, there are typically multiple modules that implement the same function(s). To prevent naming conflicts, name-mangling is used on all symbols and resolved by a compile-time version dispatcher.
y-cruncher has two algorithms for each major constant that it can compute - one for computation, and one for verification.
All complexities shown assume multiplication to be O(n log(n)). It is slightly higher than that, but for all practical purposes, O(n log(n)) is close enough.
- Square Root of n and Golden Ratio
- 1st order Newton's Method - Runtime Complexity: O(n log(n))
Note that the final radix conversion from binary to decimal has a complexity of O(n log(n)2).
- Taylor Series of exp(1): O(n log(n)2)
- Taylor Series of exp(-1): O(n log(n)2)
- Chudnovsky Formula: O(n log(n)3)
- Ramanujan's Formula: O(n log(n)3)
- Machin-Like Formulas: O(n log(n)3)
Starting from v0.6.1, y-cruncher is able to generate and utilize machin-like formulas for the log of any small integer.
- Zeta(3) - Apery's Constant
- Amdeberhan-Zeilberger Formula 2: O(n log(n)3)
- Amdeberhan-Zeilberger Formula 1: O(n log(n)3)
- Catalan's Constant
- Lupas Formula: O(n log(n)3)
- Huvent's Formula (v0.6.2 and later): O(n log(n)3)
- Ramanujan's Formula (v0.5.5 and earlier): O(n log(n)3)
- Euler-Mascheroni Constant
- Brent-McMillan with Refinement: O(n log(n)3)
- Brent-McMillan (alone): O(n log(n)3)
- Note that both of these formulas are essentially the same.
Therefore, in order for two computations to be independent enough to qualify as a verified record, they MUST be done using different n.
For simplicity of implementation, y-cruncher only uses n that are powers of two - which serves a dual purpose of allowing the use of (the easily computed) Log(2), as well as lending itself to shift optimizations.
y-cruncher's set of algorithms for multiplication is pretty standard at the low-end, but unconventional at the high-end.
Standard libraries like GMP avoid the use of floating-point FFTs due to the difficulty of proving bounds on round-off error. But y-cruncher fully exploits floating-point algorithms and ensures their accuracy by means of conservative parameter selection and algorithmic error-detection. Although this is not proven to be correct for all inputs, the probability that y-cruncher's multiplication returns an undetected incorrect value is less than the probability of a silent hardware error.
So in that sense, y-cruncher is arguably more reliable than other implementations that lack algorithmic error-detection.
Algorithms used in y-cruncher v0.6.1: (Thresholds are approximate and vary between the different tunings)
- The threshold for Hybrid NTT on x86 is actually a lot higher than 4,800,000. But FFTs consume a lot of memory. So the threshold is capped at 4,800,000 to keep memory usage from blowing up. Yes this hurts performance, but it's better than using too much memory.
- The Hybrid NTT and VST algorithms are both unpublished. Their exact run-time complexities are unknown but are both close to O(N log(N)).
- The VST algorithm (using IEEE double-precision) fails above 90 trillion digits. This is the limit of y-cruncher v0.6.x.
Algorithms not used by y-cruncher:
Several popular algorithms that are used in other bignum programs are not used by y-cruncher.
Toom-3 and higher is never optimal. Floating-point FFT completely dominates the entire range.
This is the flagship algorithm of GMP. However, it is not used in y-cruncher because the Hybrid NTT beats it on the large sizes, while Floating-point FFT utterly destroys it on the small end.
(NTT over multiple primes)
|~ O(n log(n))
This is the large-size algorithm used in many Pi programs including: PiFast, QuickPi, TachusPi, etc...
y-cruncher doesn't use it because the Hybrid NTT beats it for small sizes and the VST algorithm (most likely*) beats it on the large sizes.
*Currently, I don't have a working optimized implementation of NTT over primes to confirm that it is actually slower than VST. However, estimates based on operation counts and micro-benchmarks are favoring VST - and becoming more so as current processors are trending towards increased SIMD floating-point throughput.
NTT over primes is a thorny algorithm to implement because it needs significant amounts of micro-optimization and platform-specific code to be efficient.
Nevertheless it's a very popular algorithm and I'm not sure how everyone else handles it. A basic implementation is easy, but there's a lot to gain by going all the way down to inline-assembly - which Visual C++ does not support on x64...
In any case, 64-bit NTT over primes will be necessary if y-cruncher is to go above 90 trillion digits - as even the VST algorithm fails above that size.
Inverse and Roots:
Reciprocals, division, and square roots are all done using 1st order Newton's Method.
- Prior to v0.6.1, their implementations were very basic and unoptimized.
- Starting v0.6.1, they take advantage of middle-product and FFT caching (reuse of redundant FFT transforms).
Prior to v0.6.1, only log(2) and log(10) were supported using hard-coded machin-like formulas. A generic log(n) was needed for Ramanujan's formula for Catalan's constant. That was implemented using Arithmetic-Geometric Mean (AGM).
In v0.6.1, Ramanujan's formula for Catalan's constant was removed - thereby removing the need for a generic log(n). Instead, v0.6.1 will support computations of log(n) for any small integer n. This is done using a formula generator that generates (at run-time) machin-like formulas for arbitrary small integer n.
Generation of machin-like formulas for log(n) is done using table-lookup along with a branch-and-bound search on several argument reduction formulas.
- Prior v0.5.3, conversions were done using the standard Divide and Conquer algorithm.
- Starting from v0.5.3, they are done using Scaled Remainder Trees. This is the same approach that was used in the December 2009 world record for Pi.
- Starting from v0.6.1, the Scaled Remainder Tree algorithm is combined with the middle-product and FFT caching optimizations.
Series summation is done using standard Binary Splitting techniques with the following catches:
- Precision is tightly controlled to keep operands from becoming unnecessarily large.
- The cutting points for the binary splitting recursions are usually significantly skewed in order to reduce memory usage and improve load-balance.
- Series summation is partially done backwards. The lowest terms (terms with the largest index) are summed first. The motivation and details for this is quite intricate and goes hand-and-hand with the skewing. So I won't go into that.
This series summation scheme (including the skewed splitting and backwards summing) has been the same in all versions of y-cruncher to date. All of this is expected to change when GCD factorization is to be incorporated.