Large Multiplication

By Alexander Yee

 

(Last updated: October 15, 2023)

 

Back To:

 

Large integer multiplication is perhaps the most important part of any bignum or Pi program that intends to reach millions of digits. Nearly every non-trivial operation (division, square root, etc...) reduces to large integer multiplication.

 

Overview

 

Because of its importance, large multiplication is the most heavily optimized component in y-cruncher. Its interface along with all the supporting algorithms accounts for approximately 2/3 of the roughly 720,000 lines of code in y-cruncher.

 

As of v0.8.2, y-cruncher has the following set of algorithms and implementations:

 

Blue algorithms are currently used by the latest version of y-cruncher.

Red algorithms are no longer used by the latest version of y-cruncher.

Algorithm Code First Appearance Parallelized Swap Mode MRFM Notes
Basecase Multiplication BKT v0.1.0 Janurary 2009 No No No  
Karatsuba Multiplication No No No  
3-Way Toom-Cook No No No

Removed in v0.6.1.

Floating-Point FFT FFT No No No  
32-bit Small Primes NTT N32 v0.6.9 December 2015 Yes No No Removed in v0.8.1.
64-bit Small Primes NTT N64 v0.6.8 March 2015 Yes Yes Yes Version 1 - Removed in v0.8.2.
N63 v0.8.2 September 2023 Yes Yes Yes Version 2
Hybrid NTT HNT v0.1.0 Janurary 2009 Yes Yes No Removed in v0.8.1.
Vector-Scalable Transform
(VST Algorithm)
VST v0.6.1 February 2013 Yes Yes Yes

Versions 1 and 2

VT3 v0.8.1 July 2023 Yes Yes Yes Version 3
"Cannon Lake 2017" C17 v0.7.1 May 2016 Yes Yes Yes

Removed in v0.8.2.

This set of algorithms has been accumulating since the start of the y-cruncher project. No processor architecture uses every single algorithm in this table. But every algorithm except for Toom-3 was useful for at least one processor line at some point in the history of the y-cruncher project.

 

The 3-letter codes are the ones that appear in the stress-tester and the swap mode status printing.

 

 

Organization:

 

Internally, these algorithms are managed through a virtual method interface. This interface defines all the functions that a "multiplication algorithm" needs to have. Then each algorithm implements it. Instances of these objects are thrown into a global algorithm table with an activation threshold.

 

When a computation requests a large multiply, the framework walks the global table to select an algorithm object. Then it uses it to perform the large multiply. The overhead of this entire process is made negligible by short-circuiting small multiplications directly to the basecase algorithm.

 

This global algorithm table is customized on a per-processor basis. And there are separate tables for each of the computing modes since not all of the algorithms have implementations for the out-of-core computation modes (Swap Mode and MRFM).

 

 

Size Records:

 

Historically, the largest multiplication that y-cruncher has ever done have been part of the Pi world record computations. This happens during the "Final Multiply" step in the respective Pi computation. This "Final Multiply" is the only full-sized multiplication in the entire Pi computation.

Size (N x N decimal digits) Type Date Finished Who y-cruncher Version Algorithm Time
100 trillion Integer Multiply June 2022 Emma Haruka Iwao v0.7.8 64-bit Small Primes NTT 46 hours
62.8 trillion Integer Multiply August 2021 UAS Grisons v0.7.8 64-bit Small Primes NTT 39 hours
50 trillion Integer Square March 31, 2019 Timothy Mullican v0.7.7 64-bit Small Primes NTT 55 hours
31.4 trillion Integer Multiply January 10, 2019 Emma Haruka Iwao v0.7.6 VST Algorithm 48 hours
22.4 trillion Integer Multiply October 30, 2016 Peter Trueb v0.7.1 VST Algorithm 72 hours
13.3 trillion Integer Multiply September 12, 2014 "houkouonchi" v0.6.3 VST Algorithm 89 hours
12.1 trillion Integer Multiply December 14, 2013 Shigeru Kondo v0.6.2 VST Algorithm 32 hours
10 trillion Integer Multiply September ??, 2011 Shigeru Kondo v0.5.5 Hybrid NTT 86 hours
5 trillion Integer Multiply July 22, 2010 Shigeru Kondo v0.5.4 Hybrid NTT 42 hours

Going further back, there were many localized tests of large multiplications including the squaring of a 5.2 trillion digit integer in about 40 hours. But none of these tests were ever properly recorded - let alone in detail. The only certain detail is that they were all done using the Hybrid NTT since that was the only large-sized multiplication algorithm that y-cruncher had prior to 2012.

 

 

Size Limits:

 

The largest multiplication that y-cruncher can theoretically do is (6.7 * 1017) x (6.7 * 1017) decimal digits. This is achieved with the 64-bit NTT using 5 primes and a transform length of 7 * 252 which provides a convolution length of 4.03 * 1018 bits. However, such an operation will require over 2 exabytes of memory which is near the limit of 64-bit addressing and is therefore beyond the reach of current hardware as of 2015.

 

 

 

 

Multiplication Algorithms

 

Basic Algorithms:

 

The Basecase and Karatsuba algorithms handle the smallest products. Their implementations in y-cruncher are fairly standard, uninteresting, and inefficient. In the past, you needed to use assembly to make these algorithms efficient. But things are better now with compiler intrinsic support for both 64-bit multiply and add-with-carry.

 

Currently, the basic multiplication algorithms are not performance bottlenecks. Most of the run-time is spent in the FFT-based algorithms for large products. For this reason, very little effort has been made to optimize these basic algorithms.

 

In the past, y-cruncher also had an implementation of 3-way Toom-Cook. But it was never optimal to use on any processor and was removed after v0.5.5. Likewise, no attempt was ever made to implement higher or unbalanced Toom-Cook algorithms. Floating-Point FFT is so fast that it renders all of these more complicated methods completely obsolete. If current hardware trends continue, even Karatsuba multiplication will eventually disappear from future processors.

 

 

Floating-Point FFT:

 

The Floating-Point FFT handles medium sized products from a few hundred digits up to what fits in CPU cache.

 

y-cruncher's implementation is a complex-to-complex FFT using right-angle convolution. It uses radix-4 and split-radix butterfly reductions and supports transform lengths of 2k, 3*2k, and 5*2k. The implementation is optimized for pure speed by means of SIMD and Fused-Multiply Add (if available).

 

As of 2015, the Floating-Point FFT remains the fastest multiplication algorithm in pure computational speed. It blows away every other algorithm, Karatsuba, high-order Toom-Cook, NTT... you name it. Nothing else even comes close. And the FFT will continue to get faster as it can take advantage of the ever widening SIMD.

 

As awesome as the FFT is, it has two major drawbacks: Memory consumption and round-off error. For large products, the FFT consumes too much memory and memory bandwidth. The memory cost is actually super-linear since the number of bits that can be placed in each point decreases for larger FFTs. And even when memory quantity isn't a problem, the pressure on memory bandwidth will kill off any parallel scalability.

 

Round-off error is the other issue and is the reason why most standard libraries avoid the algorithm completely. The FFT uses floating-point arithmetic where the results need to be correctly rounded to the integer. However, it is difficult to prove that the results will always be correctly rounded.

 

 

Number-Theoretic Transform:

 

The most common alternative to the Floating-Point FFT is the Number-Theoretic Transform (NTT). NTTs are about an order of magnitude slower than FFTs, but they require only a fraction of the memory cost. So while FFTs are bandwidth constrained, NTTs tend to be compute-bound - which makes them good candidates for parallelism and out-of-core computation.

 

The classic approach to the NTT is to perform several NTTs over different prime numbers. The results are then recombined using the Chinese Remainder Theorem.

y-cruncher uses the classic approach with Victor Shoup's butterfly algorithm which requires no divisions.

 

y-cruncher's has two implementations of the NTT:

The 32-bit NTT is faster, but it has a size limit of about a billion bits. The 64-bit NTT is slower, but it has virtually no size limit.

 

The two implementations are identical in virtually every aspect other than the word-size. Both can use anywhere from 3 to 9 primes. And both support transform sizes of 2k, 3*2k, 5*2k, and 7*2k. This similarity is by design since they share much of the same code by means of template metaprogramming.

 

An early prototype of the 64-bit NTT can be found on my GitHub.

 

y-cruncher's NTTs are not optimized to be as fast as possible. Some sacrifices are made to reduce complexity and improve memory efficiency. For example, the NTT can be made faster by using 30 or 62-bit primes. But in the context of y-cruncher, this approach leads to unfortunate side-effects which are difficult to deal with.

 

 

Other Algorithms:

 

The remaining three algorithms are unique to the y-cruncher project:

The HNT and VST were developed during my college days. The Hybrid NTT came in 2008 in an attempt to find an algorithm that would serve as a middle-ground in the horrific space-time tradeoff between the FFTs and the NTTs. The success of the Hybrid NTT is what ultimately led to y-cruncher itself.

 

The VST algorithm started off as a stupid experiment in 2011 to build an NTT using only floating-point to utilize Sandy Bridge's AVX instructions. The idea was based on a prediction that AVX was just the start of a longer trend of increasingly powerful SIMD hardware with an emphasis on floating-point. Once it became clear the algorithm would work, I gave it its current name - the "Vector-Scalable Transform".

 

Initially the VST was not able to beat the Hybrid NTT and was therefore only useful for disk-swapping modes. But as was correctly predicted, SIMD became increasingly powerful with AVX2, FMA, AVX512, etc... So the VST algorithm eventually overtook the Hybrid NTT and became the main algorithm for in-memory computations.

 

The C17 algorithm is a code-word and abbreviation for "Cannon Lake 2017". It is a super-specialized algorithm I implemented back in 2016 that is designed around the AVX512-IFMA instructions which would first appear in Cannon Lake processors. The 2017 in the name is the year I predicted that Cannon Lake would launch, though it ended up slipping into 2018 due to Intel's 10nm problems.

 

 

Floating-Point Algorithms

 

The use of floating-point FFTs in bignum arithmetic has always been a source of contention.

 

Pros:

Cons:

The last point is important. The very nature of floating-point algorithms is that they have round-off error. Using them to perform an exact operation requires that round-off error be kept at an acceptable level. In practice, the radix (# of bits per point) is chosen to be large as possible based on what is empirically found to be limit before round-off error. But the provable limits on round-off are much more pessimistic.

 

Implementations that use provable bounds instead of empirical bounds require approximately double the transform length. (2x computation, 2x memory)

This is a lot of performance to be throwing away. Therefore, many programs (PiFast, TachusPi, y-cruncher) decide to "take the risk" and use the empirical bounds.

 

That said, there are varying levels for which corners can be cut:

  1. Provably Correct.
  2. Emperically correct for worst case.
  3. Emperically correct for average case.

y-cruncher is in the middle category. All the settings are extensively tested to work on carefully chosen worst-case inputs. Then an extra safety margin is added.

In other words, there are no known inputs that would fail. But nevertheless, it is still not provably correct.

 

On the other hand, Prime95 from the GIMPS project would be in the last category. It relies on balanced representation and destructive cancellation to reduce the sizes of the coefficients - thereby allowing more bits to be crammed into each point. While this works for truly "random" inputs such as those in Mersenne prime testing, it is also easy to artificially construct inputs that would break the algorithm.

 

Unfortunately, the numbers encountered in y-cruncher aren't always "completely random". Some operations such as Newton's Method can produce values with many consecutive 1 bits - which also happens to be a worst-case input for some algorithms.

 

 

 

 

Theoretical vs. Practical: Mathematics vs. Engineering

 

From a mathematical perspective, use of anything other than the provable bounds is absolutely prohibited since the algorithm could be incorrect. But from an engineering perspective, other factors come to play.

 

The goal is to multiply large numbers correctly. But there a number of things that can affect the correctness of a large multiply:

Even with a correct algorithm, it is not possible to multiply large numbers with 100% reliability. Upon one analyzing these failure cases, it becomes obvious that an imperfect algorithm can be used without affecting reliability if other sources of error are much larger.

Even if you assume bug-free software (highly unrealistic), you can't do better than 1 error in 1017 due to the hardware. Using empirical bounds for FFTs adds a probability of error that is basically negligible compared to the 1 in 1017 rate of hardware errors.

 

Regardless of what the source of error is (software/hardware/algorithmic), redundancy checks can be used at minimal cost to catch errors.

Applying a modular identity over a 64-bit prime puts the probability of failing to detect an error at about 1 in 264. Ultimately, the scenario we want to avoid is an undetected error. Assuming bug-free code, the probability of an undetected error is 1 in 264 * 1017 ~ 1036. That's not likely to happen. If anything fails, it will be something stupid like a bug in the error-detection logic...

 

Needless to say, most bignum programs that use the FFT take the engineering approach - with or without the error-detection.

y-cruncher uses the error-detection with modulo 261 - 1 (a Mersenne prime) for reasons of performance.

 

 

Other Algorithms:

 

The floating-point FFT isn't the only algorithm that suffers from this sort of round-off error. All of y-cruncher's proprietary algorithms (HNT, VST, and C17) also use

floating-point in ways that are difficult to prove correct for all inputs. So in that sense, all of these algorithms are useless for libraries that require provable correctness.

 

For that matter, even the NTTs (which are integer algorithms) have some shady floating-point usage in their CRT construction steps. But these uses are very localized. So while they haven't been rigorously proven to be correct, they have been carefully analyzed and deemed suitable for their purposes.

 

Large integer multiplication isn't the only place in y-cruncher that uses floating-point for otherwise integer tasks. This sort of stuff is littered all over the entire program. The most brutal example of this is in the Digit Viewer's binary-to-decimal conversion routine. This one has been proven correct for all inputs under the assumption of strict-IEEE floating-point behavior.

 

 

 

Out-of-Core Algorithms

 

Out-of-core Algorithms (aka External memory algorithms) are algorithms that are meant to process data that is too large to fit in memory. In the current era, large number computations of trillion of digits are typically multiple orders of magnitude too large to fit in main memory. Therefore, out-of-core algorithms are critical to performing computations of these sizes.

 

Currently, the only form of out-of-core computation that y-cruncher supports is Swap Mode which uses disk storage as the global memory space and ram as a cache. This Swap Mode is what has been used for nearly all size world records set by y-cruncher.

 

Internally, y-cruncher has a newer out-of-core mode called, "Multi-Region Far Memory" or MRFM. This is part of a larger initiative to attack the NUMA-scalability problem.

As of 2017, y-cruncher has production-ready implementations of MRFM large multiplication. But since the rest of the program isn't MRFM-ready, none of it will see the light of day any time soon - if ever. Therefore, the rest of this section will focus on just the Swap Mode even though the methods are largely the same with MRFM.

 

 

Swap Mode FFT:

 

y-cruncher's Swap Mode multiply is just the standard recursive FFT algorithm - but on disk instead of ram. All the data resides on disk. And ram is only used as a cache for the disk. Since computation can only happen in ram, any data that needs to be touched is explicitly transferred from disk to memory and back.

 

Since disk is very slow, the key to performance is to reduce the amount of data that needs to be transferred to and from disk. Broadly speaking, these are some of the optimizations that are done to reduce disk access:

  1. Large Radix Reductions: The larger the reduction size, the fewer read/write passes over disk are needed to perform the FFT. Typically, the first radix reduction is chosen large enough such that the subsequent sub-transforms fit into memory. This allows the entire FFT to be done in only two passes over the dataset. For FFTs that are orders of magnitude larger than memory, this reduction pass can be upwards of radix 1024 or higher.

  2. Aggressive Pass Merging: Steps of the FFT which are normally done as separate passes are merged together to reduce the number of times the same data needs to be read-from/written-to disk. In practice, this can get messy really quickly as it requires breaking through software abstraction layers.

  3. Overlap Computation with Disk I/O: Perform computation and disk I/O in parallel since they use different resources.

When all optimizations are put together, it often becomes possible to multiply a number by itself with only 2 read/write passes over the FFT domain data along with 1 read of the input and 1 write of the output.

 

Of course things aren't quite that simple. The number of passes over disk isn't the only thing that matters. The number of disk seeks is also important. For FFTs that are orders of magnitude larger than memory, the reduction size needed to do such a 2-pass FFT may result in a prohibitively large number of disk seeks. So there are trade-offs that need to be considered. The "Bytes/Seek" parameter in Swap Mode is the knob that tells y-cruncher how to handle this trade-off.

 

 

More recently, the CPU/memory-access performance gap has grown so large that some of these methods used by Swap Mode to minimize access to disk are now being used to minimize access to main memory.

 

 

Implementation:

 

Practically speaking, these optimized out-of-core FFT approaches are non-trivial to implement. While the total code size isn't as obnoxious as the SIMD intrinsic code that make up the kernels of the individual multiplication algorithms, there are numerous corner cases (such as carry-out) that need to be properly dealt with. And with multiple algorithms each having multiple modes, this can get very unwieldy from a development and maintenance standpoint.

 

So rather than having a dozen different out-of-core implementations, there are two generic ones: One for Swap Mode and one for MRFM.

 

Each of these generic implementations are completely built on top of a set of primitives. So any multiplication algorithm that provides these primitives will get a fully working out-of-core implementation.

 

 

 

Transform-Only Interface

 

Reuse of redundant FFT transforms is an optimization that seeks to optimize operations that multiply by the same number more than once.

Recall that an FFT-based multiply consists of transforming each of the two operands, pointwise multiplying, and inverse transforming the result.

 

A * B:

  1. F(A)
  2. F(B)
  3. I(F(A) * F(B))

There are 3 transforms. (2 forward, 1 inverse) Now if we wanted to do two multiplies...

 

A * B, A * C:

  1. F(A)
  2. F(B)
  3. I(F(A) * F(B))
  4. F(C)
  5. I(F(A) * F(C))

That's 5 transforms. (3 forward, 2 inverse) We have saved 1 transform. Assuming an algorithm with trivial pointwise convolution, this is a 16.6% savings.

This particular optimization is applicable to most Binary Splitting recursions as well as Newton's Method iterations and radix conversions.

 

In practice, the savings is slightly less since inverse transforms tend to be more expensive than forward transforms due to various overheads such as carryout.

 

 

Implementation:

 

Reusing transforms may sound like a decent optimization, but it can be very difficult to implement.

These are pretty big barriers, so few programs actually do it. y-cruncher couldn't do it prior to v0.6.1.

 

Older versions of y-cruncher (v0.5.5 and earlier) encapsulated multiplication. Because of all the different multiplication algorithms, breaking through this encapsulation to reuse transforms was simply not feasible.

 

y-cruncher v0.6.1 saw a complete redesign and rewrite of the core algorithms and representation. This time, a transform-only interface was added. All multiplication algorithms were required to implement this interface. For algorithms that didn't use transforms (Basecase/Karatsuba), the interface was faked with some overhead.

 

The transform-only interface allows high-level mathematical code to reuse redundant transforms without breaking any encapsulation.

Furthermore, it also allows other things that couldn't be done with the old interface: