**Back To:**

**Version 0.8.4 Released:** (February 21, 2024) - permalink

y-cruncher v0.8.4 has been released with most of the improvements in non-Pi related math as well as continued improvements for large computations in swap mode.

For the purpose of competitive benchmarking, v0.8.4 does have changes that theoretically affect performance of the benchmark-relevant computations. But so far I have yet to be able to measure any difference. So until someone proves me wrong, I declare that v0.8.4 benchmarks are comparable to both v0.8.3 and v0.8.2 for both competitive benchmarking and hardware reviews.

__New Math Improvements:__

Jorge Zuniga has done it again! This time with new fastest formulas for Log(2), Log(3), and Log(5). You can grab the formula files here:

- Log(2) - Zuniga (2024-i).cfg
- Log(2) - Zuniga (2024-ii).cfg
- Log(3) - Zuniga (2024-i).cfg
- Log(5) - Zuniga (2024-i).cfg

More information on MathOverflow.

Log(2) is one of the most fundumental constants and is used in many places such as the AGM algorithm for the logarithm and the Euler-Mascheroni Constant. It has been decades since the last time Log(2) has had a new fastest formula, but Jorge has found one that's 50% faster!

y-cruncher v0.8.4 will special-case for these values and will use these formulas instead of the auto-generated ArcCoth() Machin-like formulas.

And on a somewhat related note, the custom formula feature has gone through some rework and now has some new functionality:

- Square(x)
- Reciprocal(x)
- Log2
- Exp(x)
- All hyperbolic trig functions.
- All inverse hyperbolic trig functions.
- Power(x, y) now supports non-integer powers.

Some of these are convenience wrappers of existing functionality while others are entirely new. And as before, all custom formula functionality supports swap mode with checkpointing and can therefore be taken up to billions/trillions of digits if desired.

The main addition is the exponential function exp(x). This has been a notable omission given that log(x) has been supported since first release of the custom formula feature. And now that exp(x) has finally been added, it unlocks the non-integer power function as well as the rest of the hyperbolic trigonometric functions.

With that we get a new category of constants. A sample of which include:

- Ramanujan's Constant.cfg
- Gelfond-Schneider Constant.cfg
- Sqrt(2)^Sqrt(2).cfg
- exp(Pi) - Direct.cfg (was previously accessible using a hypergeometric series)
- Pi^e.cfg

I'll note that while y-cruncher *supports* exp(x), it's rather slow to evaluate. Supporting it at all was more important than making it performance optimal. Nevertheless, it is still quasi-linear in run-time, so you can (in theory) compute exp(x) for non-trivial x to trillions of digits. But it will be a longer wait given the massive implied big-O constant.

__Improvements for Large Computations:__

Now we move onto the same old stuff that's been going on since v0.8.2 - improvements relevant to very large computations.

Three main things have changed here:

- The limit of y-cruncher has been increased from 1 x 10
^{15}to 108 x 10^{15}decimal digits. - Checkpointing is now supported inside the radix conversion.
- The Amdahl's Law and AMD superalignment shenanigans has been fixed.

**Increasing the Limit from 1 x 10 ^{15} digits to 108 x 10^{15} digits:**

Historically, y-cruncher has been capped to 1 x 10^{15} digits. And with the Pi record currently sitting at 100 trillion digits (1 x 10^{14}), it's not hard to imagine the limit being reached in the near future - possibly within a few years.

The 1 x 10^{15} digits comes from what I call the "float-indexing limit". It is the use of double-precision for program "flow control parameters". For example:

- If I want to compute N digits of Pi, I need to know how many terms of the series to sum up. To determine that, I need to divide N by the convergence rate of the series. Since the convergence rate is not an integer, this operation uses floating-point.
- The series for e has factorials in the denominator. To calculate how much memory it takes to store those factorials, I need to use Stirling's formula for approximating the LogGamma function. Needless to say, this also needs floating-point arithmetic, and lots of it.

Given that double-precision only has around 16 digits of precision, it begins to run out for computations that approach that many digits. And since it's difficult to pin-point exactly when round-off becomes problematic, y-cruncher has been artificially capped at 10^{15} digits as a safe limit.

As of v0.8.4, all uses of "float-indexing" is now done using a higher-precision data type, thus eliminating this limit of 10^{15} digits. And with one limit removed, the next lowest limit kicks in which (as of this writing) is likely to be 108 x 10^{15} from the limit of the current 64-bit NTT implementation. Going beyond that gets uncertain as *many* things begin to overflow 64-bit integers. Fortunately, that will not be a problem for decades to come.

**Checkpointing the Radix Conversion:**

Moving on... Checkpointing is crucial to large computations because it allows the program to resume a run after being interrupted. Without checkpointing, the recent Pi records wouldn't be achievable as it's virtually impossible to keep a large machine running under stress for so many months without *something* breaking.

For checkpointing to be useful, they must be frequent enough that the computer can easily reach the next checkpoint without failing. If the time between two checkpoints exceeds the MTTF (Mean Time to Failure), it becomes difficult to make forward progress since it'll likely require multiple attempts to cross the gap. For this reason, 90% of y-cruncher does frequent checkpoints.

That remaining 10% is the radix conversion. This operation has never supported checkpointing because the algorithm is in-place and destructive. Meaning that it constantly overwrites its own dataset making it difficult to checkpoint because you can't roll back to data that has been overwritten. Over the years, the lack of checkpointing here has become a big enough problem that several of the Pi records have indeed required multiple attempts to get through the conversion. So it has become a liability that is getting worse as the record gets pushed higher and higher.

The difficulty of solving this combined with a lot of tech-debt in the implementation meant that I kept procrastinating it for years. And it would require rewriting the algorithm with a completely new approach. As the years went by, the old radix conversion became the last thing that was still using double-precision float-indexing. Thus it got to the point where the old radix conversion was holding back two things: checkpointing and y-cruncher's size limit.

So for v0.8.4, I finally bit the bullet and did that much needed redesign+rewrite of the entire swap mode radix conversion.

While the new radix conversion supports checkpointing, it still has a higher-than-normal chance of a failure corrupting a checkpoint due the close spatial proximity of data writes to checkpointed data. So in preparation for this, support for multiple levels of checkpointing was added in the previous version (v0.8.3) that will give you multiple points to roll back to in the event of an interruption. For the radix conversion, it means having the option to roll back to *before* the conversion should the mid-conversion checkpoint become corrupted.

And lastly, the Amdahl's Law and AMD superalignment issue has been fixed. It seems that the problem was worse than I had initially thought. So bad that even Intel servers are sometimes outperforming the AMD ones for large swap mode computations. So this will be a very much welcomed ~~improvement~~ *bugfix* for AMD servers.

----

And that should be it for this release. As with all new releases that touch critical infrastructure, I strongly recommend testing at scale before attempting any long running computation (like a Pi record attempt). My own testing capability is of course limited to the (rather modest) hardware that I personally own. For this release, that piece of "critical infrastructure" is of course the radix conversion.

**Version 0.8.3 Patched:** (February 13, 2024) - permalink

A new patch for v0.8.3 has been released that fixes a serious bug in the N63 large multiply algorithm. The bug only affects specific versions and binaries.

The affected binaries are:

- v0.8.2: 18-CNL ~ Shinoa
- v0.8.2: 22-ZN4 ~ Kizuna
- v0.8.3: 18-CNL ~ Shinoa
- v0.8.3: 22-ZN4 ~ Kizuna

Both Windows and Linux are affected. And can only happen on computations above 29 trillion digits with the likelihood increasing for larger sizes. While this bug affects very few people, it is severe for those who are as it can block record attempts.

**FLINT: The Rising Star of Bignum Libraries:** (February 9, 2024) - permalink

So apparently there's a new crown for the fastest open-sourced bignum library. It is FLINT (Fast Library for Number Theory).

Historically, the "main" open-sourced bignum library was GMP which was (and still is) used by *everything* from computer algebra systems, to Python, to even the GCC compiler. It (along with its fork, MPIR) was also the fastest bignum library out there.

In the past, I used to follow both GMP and MPIR. But development on both stagnated around 10 years ago, which caused me to tune out of the field as a whole. And because of that, I completely missed the rise of the FLINT library which seems to have taken the scene by storm.

In simple terms, FLINT is a modern bignum library that supports SIMD and parallel computing - the very two things that GMP has failed to embrace. With AVX2, it beats GMP by 3x in raw speed and blows it out once any sort of parallelism is enabled.

With GMP left in the dust, the next natural thing to compare it to is... y-cruncher. But to be clear, this is not a fair comparison. FLINT is generic library that can do a zillion things that y-cruncher cannot. And y-cruncher is specialized to do one thing - compute Pi. So they are different programs with different purposes that happen to overlap in some functionality. The fact that we're even here shows how far FLINT has come.

So we'll be comparing the following programs:

- Parallel GMP Chudnovsky: This is a 3rd party implementation of Pi using GMP that supports parallelism as well as an optimization known as GCD factorization.
- FLINT 3.0.1: Using their example program for Pi and other constants. Compiled with their new vectorized FFT using AVX512.
- y-cruncher v0.8.3: The "22-ZN4 ~ Kizuna" binary for Zen4 processors which uses AVX512.

All 3 programs use the Chudnovsky formula. And all programs include the time needed to convert the digits to decimal representation, but not the time to output or write the digits to disk.

The benchmark system is the top-of-the-line desktop Zen4:

- AMD Ryzen 9 7950X (underclocked by lowering temperature limit to 80C)
- 192 GB DDR5 (overclocked from 3600 MT/s to 5000 MT/s)
- Ubuntu 22.04 and Windows 10 (22H2)

All benchmark times are in seconds and are in highlighted in green. Memory usage is also tracked and highlighted in blue.

Unless otherwise stated, all benchmarks were run in Ubuntu 22.04.

All computations were done entirely in memory. There is no usage of disk or swap space.

**Pi - 1 thread**:

Digits |
GMP (no GCD) |
GMP (GCD) |
FLINT 3.0.1 |
y-cruncher v0.8.3 (Linux) |
y-cruncher (Windows) |
|||||

100,000,000 | 66.754 | 1.00 GB | 61.727 | 1.40 GB | 31.93 | 1.01 GB | 11.477 | 452 MB | 10.879 | 452 MB |

250,000,000 | 200.032 | 2.30 GB | 183.011 | 3.27 GB | 87.299 | 2.68 GB | 33.958 | 1.09 GB | 31.462 | 1.09 GB |

1,000,000,000 | 1,033.029 | 9.36 GB | 907.280 | 12.2 GB | 425.515 | 11.7 GB | 167.279 | 4.34 GB | 157.342 | 4.34 GB |

5,000,000,000 | 6,941.757 | 49.6 GB | 5,809.793 | 62.7 GB | 2,668.45 | 54.8 GB | 1,076.713 | 22.3 GB | 990.442 | 22.3 GB |

10,000,000,000 | 15,442.140 | 99.3 GB | 12,920.34 | 120 GB | 5,798.24 | 109 GB | 2,369.535 | 44.6 GB | 2,158.399 | 44.6 GB |

25,000,000,000 | Out of Memory | Out of Memory | Out of Memory | 6,568.023 | 111 GB | 6,107.845 | 111 GB |

The first thing we see here is that the GCD optimization gives about 10-20% speedup, but at the cost of 20% higher memory usage. This is why the program has a toggle to enable/disable GCD.

FLINT beats GMP by 2-3x. These results are consistent with their own reported benchmarks confirming that I've at least correctly compiled it. And as they mention, FLINT does not do the GCD optimization. A 2-3x gain over the existing state-of-the-art is a huge.

With this beatdown of GMP, it brings FLINT within a factor of 3 of y-cruncher. My benchmarks here show a slightly wider gap between FLINT and y-cruncher than what they have. But it is a different machine and I'm not sure if FLINT actually uses AVX512 despite having a flag for it (--enable-avx512). Regardless, AVX512 doesn't bring huge gains over AVX2 on Zen4 for pure floating-point workloads such as the type of algorithm that they are using.

For a generic library, this is very impressive as it probably lacks many of the specialized optimizations that y-cruncher does. (though worth noting that y-cruncher also does not do the GCD optimization)

Moving over to memory usage, both GMP and FLINT use more than twice the memory that y-cruncher does. This is reasonable since y-cruncher does make a substantial effort reduce its memory usage. (more on this later)

So FLINT is looking very good so far. Now let's enable some parallelism...

**Pi - 32 threads**:

Digits |
GMP (no GCD) |
GMP (GCD) |
FLINT 3.0.1 |
y-cruncher v0.8.3 (Linux) |
y-cruncher (Windows) |
|||||

100,000,000 | 23.773 | 1.25 GB | 17.862 | 1.80 GB | 6.386 | 4.52 GB | 1.559 | 507 MB | 1.455 | 507 MB |

250,000,000 | 65.799 | 3.32 GB | 50.188 | 4.29 GB | 19.438 | 9.80 GB | 4.234 | 1.15 GB | 4.079 | 1.15 GB |

1,000,000,000 | 317.333 | 12.4 GB | 224.57 | 16.6 GB | 95.201 | 34.3 GB | 19.428 | 4.40 GB | 18.793 | 4.40 GB |

5,000,000,000 | 2,004.362 | 51.7 GB | 1,312.231 | 74.6 GB | 550.325 | 138 GB | 113.422 | 22.3 GB | 111.467 | 22.3 GB |

10,000,000,000 | 4,275.017 | 99.2 GB | 2,872.915 | 141 GB | Out of Memory | 252.495 | 44.7 GB | 242.246 | 44.7 GB | |

25,000,000,000 | Out of Memory | Out of Memory | Out of Memory | 714.804 | 111 GB | 683.368 | 111 GB |

As this is a 16-core machine with hyperthreading, the maximum speedup expected would be on the order of 16x. But because of power throttling, turbo limits, and other resource bottlenecks, actual parallel speedups will fall well short of that.

The first thing we see here is that y-cruncher scales better than both GMP and FLINT as the performance gap grows to 12x and 5x respectively. From watching the Ubuntu System Monitor, it is obvious that Amdahl's Law is having an effect. Both GMP and FLINT spend significant amounts of time running code that is either not parallelized or is under-parallelized. With such a big loss to Amdahl's Law, it becomes difficult to gauge if any other bottlenecks come into play. Memory bandwidth is the one I was most interested in because that is y-cruncher's achilles heel. Maybe I'll revisit this in the future if/when FLINT is able to approach 100% CPU utilization.

But the elephant in the room is FLINT's memory usage - which skyrockets to more than 2x of what GMP needs and 5x of what y-cruncher needs. So while both GMP and y-cruncher manage to avoid blowing up memory when parallelizing, FLINT does not.

I didn't try very hard to see what FLINT is doing wrong, but I suspect it may be parallelizing at too high a level. Parallelizing at the highest level of an algorithm tends to result in memory usage that grows linearly with the # of threads until the entire working set is in memory simultaneously.

Both the 3rd party GMP Pi implementation and y-cruncher serialize the upper levels of the computation specifically to avoid blowing up memory. But while this approach reduces memory usage, it is not a free optimization and can have negative performance impact. So it's a space-time trade-off.

y-cruncher, which is aimed at extremely large computations, tries to strike a balance between speed and memory efficiency. So it trades away some speed to use less memory since it allows bigger computations to be done. Some of the old classic Pi programs like PiFast also do this. PiFast even makes this tunable where you can select how much speed to trade away to reduce memory.

**Euler-Mascheroni Constant - 32 threads:**

The memory explosion when parallelizing tends to get worse for slower constants. So lets look at the Euler-Mascheroni Constant - which is much slower than Pi.

And sure enough, FLINT memory usage spikes to nearly 20x of y-cruncher.

Digits |
FLINT 3.0.1 |
y-cruncher v0.8.3 (Linux) |
y-cruncher (Windows) |
|||

10,000,000 | 9.037 | 2.35 GB | 3.274 | 243 MB | 2.427 | 243 MB |

100,000,000 | 141.718 | 14.3 GB | 36.999 | 795 MB | 34.928 | 795 MB |

1,000,000,000 | 2,038.47 | 138 GB | 575.467 | 7.07 GB | 560.219 | 7.07 GB |

10,000,000,000 | Out of Memory | 7,139.21 | 68.6 GB | 7,005.28 | 68.6 GB |

While this looks pretty bad, it is likely far beyond what FLINT is designed for and thus not on the developer's radar. The Euler-Mascheroni Constant is also far less common than Pi and thus gets much less attention (both in usage and developer time).

Overall, FLINT has come a long way since the days of GMP's dominance. Pushing it to its limit does reveal some glaring inefficiencies which (if I've correctly accessed them), are probably easy to fix. So while I don't expect FLINT to beat y-cruncher any time soon, it's likely a few easy optimizations away from coming a lot closer.