y-cruncher Multi-Precision Library

A Multi-threaded Multi-Precision Arithmetic Library

(Powered by y-cruncher)

(Last updated: March 3, 2016)

 

Shortcuts:

YMP Library:

Large Number Objects:

Low-Level Programming:

Low-Level Programming

 

The standard functions for BigIntO and BigFloatO abstract away a lot of complexity. While they are easy to use, this abstraction comes with significant overhead - sometimes more than 50%.

 

This overhead is largely from memory allocation and page-commit. While this isn't a problem for small allocations, normal memory allocators do not attempt to pool memory when the sizes are on the order of gigabytes. Custom allocators can go a long way to solving this problem, but they still suffer from bookkeeping overhead and heap fragmentation.

 

To get around these problems, y-cruncher uses a form of "bare-metal" programming where there is no concept of RAII and all memory usage is micro-managed. This is the reason why y-cruncher's memory usage is flat throughput a computation. It computes exactly how much memory is needed, allocates it at the start of the computation, and frees it at the end.

 

Not surprisingly, this sort of bare-metal programming is very difficult and error-prone. But is by far the most efficient as it bypasses all overhead from the operating system and the memory allocation.

 

YMP exposes this bare-metal interface for users who wish to squeeze out the last 2x in performance.

 

 

Large Multiply Parameters:

 

As with nearly everything related to bignums, it all boils down to large multiplication...

A large multiply in y-cruncher/YMP currently requires 4 resources in addition to the input/output operands:

All of these except for the parallel framework need to be passed in manually. The parallel framework is maintained as a global.

This list of resources/parameters is not static and may change in the future. So to keep things manageable in the long run, these parameters are put together into a struct called, "BasicParameters".

 

Why this weird name? Because internally, there's a different one for swap mode/out-of-core operations called, "SwapParameters" - which currently has 6 parameters. Swap mode probably isn't coming to the public interface any time soon. So we won't open that can of worms.

 

 

What can you do with the Parameters?

 

Any function in y-cruncher/YMP that takes these parameters promises to do the entire operation without any resource acquisition except for those triggered indirectly by the parallel framework or from exception handling. Such functions are paired with "sizing" functions that give the minimum sizes of the scratch buffer(s).

 

This gives the user complete and absolute control over all memory usage. There are no uncertainties involving size and fragmentation.

 

The sizing functions can be used for pre-planning computations and estimating resource requirements with provable upper-bounds. This is how y-cruncher does all its memory calculations in the Custom Compute menu.

 


Basic Parameters Object:

struct BasicParameters{

    const LookupTable* tw;

    void* M;

    upL_t ML;

    upL_t tds;

};

 

Name Parameter Description
tw Twiddle Factor/Lookup Table

If the operation needs a lookup table, it will use this one rather than the global one.

{M, ML} Scratch Memory

M is a pointer to a scratch memory buffer.

The buffer must be at least ML bytes large.

 

Operations that need scratch memory will use this buffer. This buffer must be sufficiently large and be aligned to at least ALIGNMENT bytes. Failing either condition will result in undefined behavior.

 

Most operations will check their size requirement against ML and throw an exception if it is too small. But this behavior should not be relied upon. Some performance-critical operations will skip the check if it is too expensive to compute the size that is actually needed.

 

All functions that need scratch memory are paired with a sizing function that gives how large the buffer must be.

 

The ML parameter currently serves no other purpose than for optional buffer overrun detection. Setting it to -1 (largest unsigned integer) will have no effect on a correct program.

tds Task Decomposition

The desired level of task decomposition for the purpose of parallelism.

How it is actually parallelized is dependent on the underlying framework.

 

To disable parallelism, set this to 1. In most cases, set this to the number of logical cores in the system unless you plan on parallelizing at a higher level. Setting it to 0 results in undefined behavior.

 

This task decomposition parameter is a suggestion rather than a requirement. The actual level of decomposition may vary depending on numerous factors.

In most use-cases, the parameters object remains static. So it can be constructed once and reused many times. (But not in parallel due to the scratch buffer.)

 

 

In the future, more parameters may be added. As of this writing, the only ones under consideration are:

  1. Changing the parallel framework from a global to a manual parameter.
  2. Adding parameters to control NUMA.
  3. Adding parameters to specify cache hierarchy hints.

Some changes will likely break backwards compatiblity. But that's somewhat unavoidable when trying to interface with low-level internals.