(Last updated: July 29, 2014)
Back To:
Implementation (as of v0.6.5):
General Information:
Libraries and Dependencies:
Compilers:
Other Internal Requirements:
Overall Design:
ycruncher's design is split into 3 major layers. As of v0.6.5:
Instruction Set Management:
ycruncher has a compiletime framework for selecting and dispatching all the different instruction set extensions that it uses. This makes it easy to take on new instruction sets without breaking backwards compatibility for older processors.
As of v0.6.5, ycruncher uses the following instruction sets:
Binary  Target Processor(s)  Instruction Set Extensions 
x64 AVX2 ~ Airi  Intel Haswell  x64, SSE, SSE2, SSE3, SSSE3, SSE4.1, AVX, ABM, FMA3, AVX2, BMI2 
x64 XOP ~ Miyu  AMD Bulldozer  x64, SSE, SSE2, SSE3, SSSE3, SSE4.1, AVX, ABM, FMA4, XOP 
x64 AVX ~ Hina  Intel Sandy Bridge  x64, SSE, SSE2, SSE3, SSSE3, SSE4.1, AVX 
x64 SSE4.1 ~ Ushio  Intel Nehalem  x64, SSE, SSE2, SSE3, SSSE3, SSE4.1 
x64 SSE4.1 ~ Nagisa  Intel Core 2 Penryn  x64, SSE, SSE2, SSE3, SSSE3, SSE4.1 
x64 SSE3 ~ Kasumi  AMD K10  x64, SSE, SSE2, SSE3 
x86 SSE3    SSE, SSE2, SSE3 
x86    none 
The extensions in this table are explicitly used in the source code either via compiler intrinsics or inline assembly. But it is possible that the compiler may generate other instructions that are not listed here but are implied by turning on a higher instruction set. (i.e. SSE implies MMX. BMI2 implies BMI1.)
ycruncher has two algorithms for each major constant that it can compute  one for computation, and one for verification.
All complexities shown assume multiplication to be O(n log(n)). It is slightly higher than that, but for all practical purposes, O(n log(n)) is close enough.
Both of these formulas were pulled from: http://numbers.computation.free.fr/Constants/PiProgram/userconstants.html
Note that the AGMbased algorithms are probably faster. But ycruncher currently uses these seriesbased formulas because:
Expanded Articles:
Multiplication:
Like most bignum programs, ycruncher uses different multiplication algorithms for different sizes:
As of v0.6.3, ycruncher uses:
Standard libraries like GMP avoid the use of floatingpoint FFTs due to the difficulty of proving bounds on roundoff error. But with modern SIMD, the performance of floatingpoint is through the roof. So rather than throwing all this computational power away, ycruncher utilizes it to the maximum extent.
The FFT, Hybrid NTT, and VST algorithms all use floatingpoint in a way that (due to roundoff error) has not been proven to be mathematically correct.
But ycruncher ensures their correctness by means of conservative parameter selection and algorithmic errordetection. Barring any programming bugs, the probability that ycruncher's multiplication makes an undetected error is less then the probability of a silent hardware error. So from an engineering standpoint, ycruncher is actually more reliable than other implementations that lack algorithmic errordetection.
The last two of these algorithms (Hybrid NTT and VST) are homebrew algorithms that were developed between 2008  2012. Both of them remain unpublished.
ycruncher currently does not use any of the following algorithms:
ToomCook is completely outclassed by the floatingpoint FFT. SSA doesn't stand much of a chance either so as long as the FFT fits in CPU cache.
Outside of the processor cache, ycruncher's current large multiply code seems to be better. Furthermore, neither ToomCook nor SSA are particularly vectorizable.
In the past, the NTT over small primes algorithm was blown off as not feasible. But recent advances in the multiplymod algorithm along with better hardware support for the 64bit integer mulitply are making things more interesting. Nevertheless, preliminary findings on a Haswell processor suggest that it is unlikely to be faster than what ycruncher currently has.
Log(n):
Prior to v0.6.1, only log(2) and log(10) were supported using hardcoded machinlike formulas. A generic log(n) was needed for Ramanujan's formula for Catalan's constant. That was implemented using ArithmeticGeometric Mean (AGM).
In v0.6.1, Ramanujan's formula for Catalan's constant was removed  thereby removing the need for a generic log(n). Instead, v0.6.1 supports the computation of log(n) for any small integer n. This is done using a formula generator that generates (at runtime) machinlike formulas for arbitrary small integer n.
Generation of machinlike formulas for log(n) is done using tablelookup along with a branchandbound search on several argument reduction formulas.
Series Summation:
Series summation is done using standard Binary Splitting techniques with the following catches:
This series summation scheme (including the skewed splitting and backwards summing) has been the same in all versions of ycruncher to date. All of this is expected to change when GCD factorization is to be incorporated.
A few years ago, ycruncher was pretty heavily developed. But I've since left school and found a fulltime job.
While ycruncher will probably remain a sidehobby for the near future, it won't get as much attention as it used to get.
At the very least, I'll continue to fix whatever bugs that are discovered. And I'll make an effort to keep the program uptodate with the latest instruction set extensions and development toolchains. But for the most part, the project is done.
Nevertheless, it's an neverending project. So there are things on the todo list. But it can be a long time before anything gets done.
Feature  Description  Notes 
AVX2  AVX2 support for Haswell processors. 
Done: v0.6.5 
Command Line Options  Run the program directly from the command line without needing to enter input. All the major features in ycruncher should be accessible via command line. 
By sheer popular demand... 
Reduced Memory Mode  For Pi Chudnovsky and Ramanujan, add a mode that will allow a computation to be done using less memory/disk at the cost of slower performance.
For a typical computation, most of the work requires very little memory. It's the occasional memory spike that causes ycruncher to have such a high memory requirement.
There are about 4 large memory spikes in a Pi computation. In approximate descending order of size, they are:
These spikes can be flattened via spacetime tradeoffs in the respective algorithms. Since the the tradeoff only needs to be done at the spikes, the overall performance hit should be reasonably small.
This feature is in progress and is able to suppress the 1st spike. Since the 2nd spike isn't much smaller than the 1st spike, the drop in memory usage is less than 10% making it almost pointless. Neverthess, we used it in the 12.1 trillion digit computation of Pi. 
The feature is usable. But since it is far from complete, it is currently disabled in the publicly released versions of the program. 
GCD Factorization  Implement the optimization described here. At minimum do it for Pi Chudnovsky+Ramanujan. See if it can also be done to Zeta(3) and Catalan.
Because of the way that ycruncher micromanages its memory, this isn't going to be easy to do. Furthermore, ycruncher lacks frameworks for integer division and segmented sieve. 

Small Primes NTT  It seems that ycruncher is the only serious Piprogram that doesn't use this algorithm. But that may change in the future.
While a 64bit small primes NTT isn't expected to be faster than the existing largemultiply algorithms, it is capable of going above 90 trillion digits.
At those sizes, the computation will probably be completely I/Obound anyway. 
