y-cruncher - Language and Algorithms
By Alexander J. Yee
(Last updated:
April 4, 2010)
Back To:
Language:
- y-cruncher is written in C and C++ with Intel SSE intrinsics and WinAPI threading and I/O.
- 99% of the program is C and 1% is C++. (All compiled in C++ of course.) No assembly or inline assembly was used.
- Virtually all arithmetic and low-level code is done using in-place functional programming. High-level code is done largely using lightweight objects.
- y-cruncher is compiled using different compilers.
- The versions that use SSE instructions are compiled using the Intel Compiler.
- The legacy x86 version is compiled using Microsoft Visual Studio.
- All versions prior to version 0.4.3 are compiled using Microsoft Visual Studio.
- Note that I have recently received a lot of criticism for using the Intel Compiler because of its purported "unfairness" to non-Intel processors.
Regardess of whether or not this is indeed the case, the Intel Compiler compiles the fastest code among all the compilers that I have tried - even on AMD processors. Therefore, I have no intention of changing compilers for now.
- y-cruncher will remain closed-source. For now at least... So please stop asking me.
- 99% of the source code is portable (standard C/C++)
- y-cruncher supports both WinAPI and OpenMP for multi-threading. The WinAPI implementation goes through wrapper functions to enable easy porting to any library with similar usage and semantics as WinAPI and Pthreads. (Although no Pthread implementation currently exists yet.)
- All code that has been optimized using Visual Studio-specific functions also have portable versions. This allows y-cruncher to be compiled on any standard C++ compiler.
- All functions that use WinAPI go through wrappers. This is the case for all WinAPI thread-control and buffer-controlled I/O.
- Starting from v0.4.4, all new code that use SSE are written in a "vector-generic" manner.
- Allows the code to be easily ported to any vector size.
- No vectorization (via standard C/C++) is still supported as this is simply a vector of size 1.
- Allows the code to be easily ported to any vector architecture that supports the basic functions. (packed addition, subtraction, and multiplication)
As of v0.5.2, the breakdown for lines of code is as follows:
| Category |
Lines of Code |
Bytes of Code |
Active Code |
166,938 |
5,495,464 |
Under Development |
< 10,000 |
|
Deactivated Code* |
173,537 |
7,444,695 |
|
| Category |
Lines of Code |
Bytes of Code |
APIs |
5,032 |
165,645 |
Interface + Minor Features |
3,391 |
109,911 |
Digit Writers + Readers |
4,760 |
149,974 |
Profiling + Tuning |
5,842 |
195,006 |
Testers |
7,300 |
232,967 |
Computation Infrastructure |
3,478 |
118,252 |
Objects |
4,422 |
143,966 |
Functions + Constants |
41,937 |
1,202,895 |
Basic Arithmetic |
1,990 |
62,397 |
Basic Multiplication |
2,644 |
87,271 |
FFT (Original) - x87 + SSE3 + SSE4.1 |
19,788 |
725,617 |
FFT v2 (New) - x64 Vector Scalable |
24,900 |
847,367 |
Hybrid NTT |
35,334 |
1,202,216 |
Multiplication Wrappers |
837 |
30,535 |
Other |
5,283 |
221,445 |
Total (All Active Code) |
166,938 |
5,495,464 |
|
| Functions + Constants |
Lines of Code |
Bytes of Code |
Radix Conversion |
2,704 |
80,625 |
Arithmetic-Geometric Mean |
157 |
4,616 |
Division/Reciprocation |
450 |
11,464 |
Inverse Square Root |
545 |
13,054 |
Integer ArcCoth() |
3,092 |
84,838 |
Square Root of n |
374 |
11,537 |
Golden Ratio |
369 |
12,617 |
e |
4,426 |
112,888 |
Pi |
6,842 |
198,465 |
Log(2) |
1,275 |
37,910 |
Log(10) |
1,262 |
37,599 |
Zeta(3) - Apery's Constant |
6,197 |
174,191 |
Catalan's Constant |
6,082 |
176,129 |
Euler-Mascheroni Constant |
7,211 |
212,686 |
Other |
951 |
34,276 |
Total (Functions + Constants) |
41,937 |
1,202,895 |
|
- # of lines of code can vary greatly depending on formatting conventions and programming styles. Therefore, I've included the total amount of code (in bytes).
This breaks down to about 33 bytes per line of code - with an average indentation of about 4 - 8 spaces.
- Of the 166,938 lines of active code, roughly 11,000 lines were written via scripting. (using a program to write source code)
The mini-programs that were used to generate these 11,000 lines of code have also been included into the total line count. (not in any particular category)
- Deactivated Code includes all code that is no longer used or compiled since the project began in January 2008. This includes all prototyping code that was never intended to be used in a publicly released version of y-cruncher. Most of this code will no longer compile as they are no longer updated with the rest of the program.
Formulas:
y-cruncher has two algorithms for each major constant that it can compute - one for computation, and one for verification.
All complexities shown assume multiplication to be O(n log(n)). It is slightly higher than that, but for all practical purposes, O(n log(n)) close enough.
- Square Root of n and Golden Ratio
- 1st order Newton's Method - Runtime Complexity: O(n log(n))
Note that the final radix conversion from binary to decimal has a complexity of O(n log(n)2).
- e
- Taylor Series of exp(1): O(n log(n)2)

- Taylor Series of exp(-1): O(n log(n)2)

-
- Pi
- Chudnovsky Formula: O(n log(n)3)

- Ramanujan's Formula: O(n log(n)3)

-
- Log(2)
- Machin-Like Formulas: O(n log(n)3)
/log(2)_machin1.jpg)
/log(2)_machin2.jpg)
/log(2)_machin3.jpg)
Despite the fact that the formulas are so similar, they are sufficient to qualify as computation and verification.
As long as two machin-like formula for the same constant have different coefficients for all shared terms, they are sufficent for computation and verification.
-
- Log(10)
- Machin-Like Formulas: O(n log(n)3)
/log(10)_machin1.jpg)
/log(10)_machin2.jpg)
- Zeta(3) - Apery's Constant
- Amdeberhan-Zeilberger Formula 2: O(n log(n)3)
/zeta(3)_amdeberhan-zeiberger-2.jpg)
/zeta(3)_amdeberhan-zeiberger-2-P.jpg)
- Amdeberhan-Zeilberger Formula 1: O(n log(n)3)
/zeta(3)_amdeberhan-zeiberger-1.jpg)
- Catalan's Constant
- Lupas Formula: O(n log(n)3)
- Ramanujan's Formula: O(n log(n)3)

- Euler-Mascheroni Constant
- Brent-McMillan with Refinement: O(n log(n)3)


- Brent-McMillan (alone): O(n log(n)3)


- Note that both of these formulas are essentially the same.
Therefore, in order for two computations to be independent enough to qualify as a verified record, they MUST be done using different n.
For simplicity of implementation, y-cruncher only uses n that are powers of two - which serves a dual purpose of allowing the use of (the easily computed) Log(2), as well as lending itself to shift optimizations.
Arithmetic Algorithms:
Multiplication:
y-cruncher uses a combination of grade-school, Karatsuba, Fast Fourier Transform, and Hybridized Number-Theoretic Transform multiplication algorithms.
This rather unique combination of integer and floating-point based algorithms has proven itself to be extremely useful. Not only is it fast and memory efficient, but it is also well suited for parallelism and vectorization.
Even without the use of assembly language optimizations, the current multiplication scheme used in y-cruncher is faster than that of GMP (as of the version used in Mathematica 6).
(GMP is an extremely fast arbitrary precison arithmetic library that is used by virtually every major math program including Mathematica, and Matlab. It is fast due in part to its use of highly optimized and specialized assembly code for different processors.)
The table below summarizes the approximate ranges where each algorithm is used: (for all versions prior to v0.4.3)
(v0.4.3 has a completely different tuning and is a lot more complicated than this.)
| Algorithm |
Complexity |
Approximate Range (decimal digits) |
| x86 |
x64 |
| Basecase (grade-school multiplication) |
 |
1 - 130 |
1 - 170 |
| Karatsuba |
 |
130 - 1000 |
170 - 560 |
| 3-Way Toom-Cook |
 |
Implemented - But Never Optimal to Use |
| Floating-Point Fast Fourier Transform (FFT) |

w = # bits in mantissa |
1000 - 4,800,000 |
560 - 120,000 |
| Hybridized Number-Theoretic Transform (Hybrid NTT) |
Better than FFT, but worse than SSA
Pending Further Analysis... |
> 4,800,000* |
> 120,000 |
| Schönhage-Strassen Algorithm (SSA) |
 |
Implemented - But Never Optimal to Use** |
The current method of having a single threshold between each algorithm is not optimal as it does not take into account for multiple crossover points between adjacent algorithms.
*From my experience, it is never optimal to use Hybrid NTT on x86. FFT is faster for anything under the 32-bit limit.
But in the interest of reducing memory usage for large computations, I've forced y-cruncher to use Hybrid NTT for anything above 4.8 million digits - even if it means significantly slowing down a computation.
**The difference between the run-time complexities of Hybrid NTT and SSA is on the order of double-logarithms. However, SSA has a much larger Big-O constant - so large that Hybrid NTT is a lot faster for any attainable size despite having a slightly inferior complexity.
(Of course, this could change if there are any new ground-breaking optimizations that apply to SSA but not to Hybrid NTT.)
Other:
Nothing special here. These are all standard algorithms and approaches.
- Division: Division is done using first-order Newton's Method.
- Square Roots/Nth roots: All are done using first-order Newton's Method.
- Natural Logarithm: Logs that don't have a hard-coded Machin-like formula are done using Arithmetic-Geometric Mean (AGM).
- Radix Conversion: Prior to v0.5.3, radix conversions to base 10 were done using the standard Divide and Conquer algorithm.
Starting from v0.5.3, they are done using Scaled Remainder Trees. This is the same approach that was used in the December 2009 world record for Pi.
- Series Summing: Series summing is done using standard Binary Splitting techniques with precision control to keep memory usage bounded.
As far as optimizations go, Division and Square Roots have been given very little attention and are therefore inefficient. (The current implementations aren't even fully paralleled.) Natural Log via AGM suffers as a result. Division and square roots account for such a tiny portion of the total run-time for the majority of constants that it has not been worth the effort to improve it. Most of the optimizations (as of version 0.5.3) have been concentrated in multiplication and Binary Splitting because they clearly dominate total run-time.