y-cruncher - Language and Algorithms

By Alexander J. Yee

 

(Last updated: August 6, 2011)

 

 

Back To:

Language:

 

Code-Size: Lines of Code in y-cruncher:

Although y-cruncher is completely close-sourced (and will likely remain that way for a long time), I'm fine with revealing some of the structure to satisfy the curious.

 

Below is the approximate code-size distribution of the y-cruncher project (July 2011).

Some of the numbers are more up-to-date than others. But for the most part, they are relatively accurate.

Category Lines of Code Bytes of Code Files Comments
 
General        
Interface + Minor Features
3,497
110,905
6
Bench/Batch/Stress/Custom Compute
Computation Infrastructure
2,894
97,126
4
Internal interface for launching computations.
Digit Conversion
6,912
245,588
38
Digit format conversions + compression/decompression
Digit Readers/Writers
7,569
236,013
27
Digit Viewer + Internal Digit Writer
Architecture-Specific Profiles
13,687
438,335
25

Settings for: x86, x86 SSE3, x64 SSE3 (Kasumi), x64 SSE4.1 (Nagisa/Ushio), x64 AVX (Hina)

About 11,000 lines of this code is generated by the superoptimizer.

APIs
8,019
270,650
44
Home-made Libaries: Memory Allocation, Threads, Disk I/O, Time, Strings, etc...
Internal Object Interface
5,182
170,883
22
Objects for: Large Integers, Floating-Point
 
Functions        
Division/Reciprocation
594
17,110
1
 
Integer ArcCoth()
3,174
87,369
9
Used by Log(2), Log(10), and the Euler-Mascheroni Constant
Radix Conversion
1,977
61,607
7
 
Other Minor Functions
253
8,624
3
AGM for Natural Log, Integer Powers, etc...
 
Constants        
Square Root + Golden Ratio
1,427
42,623
7
Includes generic square roots.
e
4,642
121,095
10
exp(1) and exp(-1) are each about 1,100 lines. Fused summation is 1,700 lines. Rest is shared.
Pi - Chudnovsky
3,362
98,342
8
Uses square roots.
Pi - Ramanujan
3,153
91,090
8
Uses square roots.
Pi - (shared)
725
22,550
4
 
Log(2)
1,366
41,743
5
Uses Integer ArcCoth().
Log(10)
1,359
41,661
5
Uses Integer ArcCoth().
Zeta(3) - AZ1
2,851
79,610
8
 
Zeta(3) - AZ2
2,923
82,472
8
 
Zeta(3) - (shared)
725
22,839
4
 
Catalan - Lupas
3,127
88,531
10
 
Catalan - Ramanujan
2,752
80,567
10
No Advanced Swap Implementation yet. Uses Pi, AGM natural Log, and square roots.
Catalan - (shared)
439
14,994
3
 
Euler-Mascheroni Constant
7,247
214,528
25
No Advanced Swap Implementation yet. Both algorithms use the same code. Uses Log(2).
Other
3,276
150,197
12
 
 
Internal Modules        
Basic Arithmetic Unit
1,996
65,007
14
Addition, Subtraction, Shift, Checksum-Hashing, Random Number Generation, etc...
Basic Multiplication Unit
3,551
119,519
4
32-bit implementations of: N x 1 Multiply, Basecase, Karatsuba, and Toom-3.
FFT (original)
16,347
611,337
65
Original FFT from the first release. Generic, SSE3, and SSE4.1. (depreciated)
FFT (v2)
28,498
961,047
75
Vector-Scalable FFT: SSE3, SSE4.1, AVX, and FMA (planned).
FFT (shared)
2,968
101,042
7
Shared by both FFTs.
Hybrid NTT (core only)
22,194
790,463
57

Core components only. Total code-size for this algorithm is about 35,000 lines.

An additional 13,000 lines belonging to this algorithm are scattered across other categories.

Multiplication Wrappers
6,198
206,845
12

Builds a multiplication interface around the FFT/FFTv2 and the Hybrid NTT.

All three algorithms (Basic, FFT, and Hybrid NTT) are wrapped together by an algorithm dispatcher.

Other
2,791
121,163
14
 
 
Total Active Code:
177,675
5,913,475
561
Includes all code that is currently active in the released version of y-cruncher v0.5.5.
 
 
Development Code
(Disabled in Public Releases)
       
Auther's Version Features
6,538
220,070
8
Benchmarking features for each individual multiplication algorithm.
Performance Tuning
5,174
165,853
14
Includes the internal superoptimizer.
Testers/Debuggers
2,529
77,959
12
 
Total Development Code:
14,241
463,882
34
 
 
Unfinished Components        
YMP Front-End Library
10,456
358,459
44

100% complete (Alpha Testing) - ETA: sometime after v0.6.1

Dynamic-Link Library for y-cruncher's arithmetic library.

All essential features have been implemented. All new features are extraneous.

y-cruncher Modules Wrapper
5,748
228,044
26

100% complete - ETA: v0.6.1

Compile-time code dispatcher for all of y-cruncher's low-level functions.

Basic Arithmetic Unit (v2)
14,546
483,045
78

100% complete - ETA: v0.6.1

Native 64-bit/vector arithmetic: Shift, N x 1/Basecase/Karatsuba Multiply

Partial support for Delayed-Carry Arithmetic.

Multiplication Wrappers (v2)
6,508
212,278
26

100% complete - ETA: v0.6.1

Native 64-bit. Flexible memory management. Double-ended Partial products.

Internal Data Conversion
4,757
169,304
42

100% complete - ETA: v0.6.1

A dedicated unit for converting between different internal representations.

This module is currently only used by the YMP Front-End Library Interface and Hybrid NTT (v2).

Neither of these are in production yet. So there's no point in enabling this module yet.

Multi-Layer Raid-File
4,368
139,828
21

70% complete - ETA: v0.6.1? - Usable. Recovery features not implemented yet.

A new swap-file interface that supports hybrid RAID 0/3.

VST Multiplication Algorithm
57,708
1,901,957
183

95% complete - ETA: v0.6.1? - Complete. Not integrated yet. Undergoing intensive testing.

Another experimental home-made algorithm. Faster than Hybrid NTT above 50 billion digits.

More extensions are planned for the future.

Integer NTT (x64)
720
23,438
5
Not Started - Still in experimental phase.

Classic Number-Theoretic Transform over multiple primes.

Total Unfinished Code:
104,811
3,516,353
425
 
 
 
Canceled/On Hold        
Hybrid NTT (v2)
41,667
1,469,808
119

30% complete - On Hold Indefinitely. (The newly completed VST Algorithm is good enough.)

Hybrid NTT version 2

Total Canceled/On Hold Code:
41,667
1,469,808
119
 
         
         
Grand Total:
316,622
10,680,436
1,081
 

Notes:

The implementation for each of these is about 3000 lines - including multi-threading and swap. For each algorithm listed here, these 3000 lines are mostly copy/paste duplicates of each other with the necessary changes made for each algorithm.

 

Formulas:

y-cruncher has two algorithms for each major constant that it can compute - one for computation, and one for verification.

 

All complexities shown assume multiplication to be O(n log(n)). It is slightly higher than that, but for all practical purposes, O(n log(n)) is close enough.

 

Square Root of n and Golden Ratio

1st order Newton's Method - Runtime Complexity: O(n log(n))
Note that the final radix conversion from binary to decimal has a complexity of O(n log(n)2).
e

Taylor Series of exp(1): O(n log(n)2)


Taylor Series of exp(-1): O(n log(n)2)


 
Pi

Chudnovsky Formula: O(n log(n)3)


Ramanujan's Formula: O(n log(n)3)


 
Log(2)

Machin-Like Formulas: O(n log(n)3)




Despite the fact that the formulas are so similar, they are sufficient to qualify as computation and verification.
As long as two machin-like formula for the same constant have different coefficients for all shared terms, they are sufficent for computation and verification.

 
Log(10)

Machin-Like Formulas: O(n log(n)3)



 
Zeta(3) - Apery's Constant

Amdeberhan-Zeilberger Formula 2: O(n log(n)3)



Amdeberhan-Zeilberger Formula 1: O(n log(n)3)



Catalan's Constant

Lupas Formula: O(n log(n)3)


Ramanujan's Formula: O(n log(n)3)



Euler-Mascheroni Constant

Brent-McMillan with Refinement: O(n log(n)3)



                          


Brent-McMillan (alone): O(n log(n)3)



                 

Note that both of these formulas are essentially the same.
Therefore, in order for two computations to be independent enough to qualify as a verified record, they MUST be done using different n.

For simplicity of implementation, y-cruncher only uses n that are powers of two - which serves a dual purpose of allowing the use of (the easily computed) Log(2), as well as lending itself to shift optimizations.

Arithmetic Algorithms:

 

Multiplication:

y-cruncher uses a combination of grade-school, Karatsuba, Fast Fourier Transform, and Hybridized Number-Theoretic Transform multiplication algorithms.

This rather unique combination of integer and floating-point based algorithms has proven itself to be extremely useful. Not only is it fast and memory efficient, but it is also well suited for parallelism and vectorization.


Even without the use of assembly language optimizations, the current multiplication scheme used in y-cruncher is faster than that of GMP (as of the version used in Mathematica 6).

(GMP is an extremely fast arbitrary precison arithmetic library that is used by virtually every major math program including Mathematica, and Matlab. It is fast due in part to its use of highly optimized and specialized assembly code for different processors.)


The table below summarizes the approximate ranges where each algorithm is used: (for all versions prior to v0.4.3)

(v0.4.3 has a completely different tuning and is a lot more complicated than this.)

Algorithm Complexity Approximate Range (decimal digits)
x86 x64
Basecase (grade-school multiplication) 1 - 130 1 - 170
Karatsuba 130 - 1000 170 - 560
3-Way Toom-Cook Implemented - But Never Optimal to Use
Floating-Point Fast Fourier Transform (FFT)
w = # bits in mantissa
1000 - 4,800,000 560 - 120,000
Hybridized Number-Theoretic Transform (Hybrid NTT) Better than FFT, but worse than SSA
Pending Further Analysis...
> 4,800,000* > 120,000
Schönhage-Strassen Algorithm (SSA) Implemented - But Never Optimal to Use**

The current method of having a single threshold between each algorithm is not optimal as it does not take into account for multiple crossover points between adjacent algorithms.

 

*From my experience, it is never optimal to use Hybrid NTT on x86. FFT is faster for anything under the 32-bit limit.
But in the interest of reducing memory usage for large computations, I've forced y-cruncher to use Hybrid NTT for anything above 4.8 million digits - even if it means significantly slowing down a computation.

 

**The difference between the run-time complexities of Hybrid NTT and SSA is on the order of double-logarithms. However, SSA has a much larger Big-O constant - so large that Hybrid NTT is a lot faster for any attainable size despite having a slightly inferior complexity.
(Of course, this could change if there are any new ground-breaking optimizations that apply to SSA but not to Hybrid NTT.)

 

 

Other:

Nothing special here. These are all standard algorithms and approaches.

As far as optimizations go, Division and Square Roots have been given very little attention and are therefore inefficient. (The current implementations aren't even fully paralleled.) Natural Log via AGM suffers as a result. Division and square roots account for such a tiny portion of the total run-time for the majority of constants that it has not been worth the effort to improve it. Most of the optimizations (as of version 0.5.3) have been concentrated in multiplication and Binary Splitting because they clearly dominate total run-time.