y-cruncher Multi-Precision Library

A Multi-threaded Multi-Precision Arithmetic Library

(Powered by y-cruncher)

(Last updated: March 3, 2016)

Shortcuts:

Numberworld Home

YMP Library:

Large Number Objects:

Low-Level Programming:

Low-Level Functions

Low-Level functions are the most basic and fundamental operations in the y-cruncher project. These functions are designed to be fast and minimal. They provide little to no abstraction and are therefore hard to use.

These low-level functions map one-to-one with internal y-cruncher routines and are analagous to GMP's "mpn" layer.

The main difference is that GMP's mpn-layer is more feature rich whereas YMP's low-level layer is more focused on beefing up the large multiplication with features like parallelism and memory management.

Data Representation:

Low-Level functions operate directly on raw arrays of 32 or 64 bit integer words. These words represent "digits" in base 2³² or 2⁶⁴ laid out in little-endian order.

The integer that is represented is as follows:

large integer = A[0] + A[1]*base¹ + A[2]*base² + A[3]*base³ + ...

By design, all low-level functions support both 32-bit and 64-bit words for all architectures. But the best word-size is usually the one matching the native word size.

General Notes:

Unless otherwise stated, the word ordering is little-endian. Each pointer to a bignum points to the least significant word of the number. Words of increasing magnitude are at increasing indexes. Carry-propagation goes in the direction of increasing address.

Unless otherwise stated, all length parameters must be greater than zero. A length of zero will result in undefined behavior.

Unless otherwise stated, all non-const pointer parameters are implicitly restrict. In other words, aliasing/overlapping between pointer ranges is not allowed unless the pointers are const. Input and output pointers cannot be aliased.

Unless otherwise stated, all functions are thread-safe provided that all operands are not modified concurrently. Also note that concurrent reads to output operands will still result in undefined behavior in the concurrent thread.

Unless otherwise stated, all functions do no memory allocation except for those indirectly caused by multi-threading.

Non-critical errors are passed up via C++ exceptions. Critical errors will exit the application.

The function definitions here may not be exactly as they are shown. They may be macros or templates that imitate the documented function. So use them normally and avoid doing crazy things like making function pointers of them.

Performance Notes:

The large multiply functions that take a BasicParameters parameter have significant calling overhead. So they may not be great for very small products.
This overhead comes from algorithm selection and exception handling across the DLL boundary.

All the large multiply functions come in two flavors: square and multiply. Squaring is generally faster than multiplying with the same operand.

Missing Functionality:

The functions documented here cover most of the low-level functionality in y-cruncher. The only major one that is missing is the transform-only interface which is not user-friendly enough to enable yet.

Functions such as division and radix conversion are missing because they are not part of y-cruncher's low-level functionality. These are done at a higher level since they use algorithms that are difficult to implement efficiently without object abstraction.

Function List:

Headers: ymp/LowLevel.h
Sources: none
Namespace: ymp::LowLevel

Carry Propagation:

add_ip()
add_ir()

Addition/Subtraction:

add_ip()
sub_ip()
add_ir()
sub_ir()
add_nr()
sub_nr()

Single-Word Multiply:

mul_ir()
mul_nr()

Large Multiplication:

Basecase Multiply:
- sqrB()
- mulB()
Full Product Multiply:
- sqr()
- mul()
- mul_Psize()
Partial Product Multiply:
- sqrp()
- mulp()
- mulp_Psize()
Incremental Scratch Buffer Size:
- mul_iPsize()

Carry Propagation:

Add Carry and Propagate:

void add_ip (u32_t* T, u32_t cin);

void add_ip (u64_t* T, u64_t cin);

Description:

If cin is zero, do nothing.

If cin is not zero, add 1 to T and propagate as far as necessary. The caller must ensure that it will never propagate into unmapped memory.

Performance:

Best Case: O(1)
Average Case: O(1)
Worst Case: O(N) where N is the number of words that it propagates.

Add Carry, Propagate, and Return:

u32_t add_ir (u32_t* T, upL_t L, u32_t cin);

u64_t add_ir (u64_t* T, upL_t L, u64_t cin);

Description:

If cin is zero, do nothing.

If cin is not zero, add 1 to {T, L}. Return the carryout after propagating L words. The carryout will always be 0 or 1.

Performance:

Best Case: O(1)
Average Case: O(1)
Worst Case: O(L)

Addition/Subtraction:

In-place Add/Subtract and Propagate Carry:

void add_ip (u32_t* T, const u32_t* A, upL_t L, u32_t cin = 0);

void add_ip (u64_t* T, const u64_t* A, upL_t L, u64_t cin = 0);

void sub_ip (u32_t* T, const u32_t* A, upL_t L, u32_t cin = 0);

void sub_ip (u64_t* T, const u64_t* A, upL_t L, u64_t cin = 0);

Description:

If cin is zero:

{T, L} = {T, L} + {A, L} and propagate as far as needed.

{T, L} = {T, L} - {A, L} and propagate as far as needed.

otherwise:

{T, L} = {T, L} + {A, L} + 1 and propagate as far as needed.

{T, L} = {T, L} - {A, L} - 1 and propagate as far as needed.

Performance:

All Cases: O(L + N) where N is the number of words that it propagates.

In-place Add/Subtract and Return Carry:

u32_t add_ir (u32_t* T, const u32_t* A, upL_t L, u32_t cin = 0);

u64_t add_ir (u64_t* T, const u64_t* A, upL_t L, u64_t cin = 0);

u32_t sub_ir (u32_t* T, const u32_t* A, upL_t L, u32_t cin = 0);

u64_t sub_ir (u64_t* T, const u64_t* A, upL_t L, u64_t cin = 0);

Description:

If cin is zero:

{T, L} = {T, L} + {A, L} and return carryout.

{T, L} = {T, L} - {A, L} and return carryout.

otherwise:

{T, L} = {T, L} + {A, L} + 1 and return carryout.

{T, L} = {T, L} - {A, L} - 1 and return carryout.

Performance:

All Cases: O(L)

Out-of-place Add/Subtract and Return Carry:

u32_t add_nr (u32_t* T, const u32_t* A, const u32_t* B, upL_t L, u32_t cin = 0);

u64_t add_nr (u64_t* T, const u64_t* A, const u64_t* B, upL_t L, u64_t cin = 0);

u32_t sub_nr (u32_t* T, const u32_t* A, const u32_t* B, upL_t L, u32_t cin = 0);

u64_t sub_nr (u64_t* T, const u64_t* A, const u64_t* B, upL_t L, u64_t cin = 0);

Description:

If cin is zero:

{T, L} = {A, L} + {B, L} and return carryout.

{T, L} = {A, L} - {B, L} and return carryout.

otherwise:

{T, L} = {A, L} + {B, L} + 1 and return carryout.

{T, L} = {A, L} - {B, L} - 1 and return carryout.

Performance:

All Cases: O(L)

Single-Word Multiply:

In-place Single-Word Multiply:

u32_t mul_ir (u32_t* T, upL_t L, u32_t B, u32_t cin = 0);

u64_t mul_ir (u64_t* T, upL_t L, u64_t B, u64_t cin = 0);

Description:

{T, L} = {T, L} * B + cin

Returns the carryout word.

Performance:

All Cases: O(L)

Out-of-place Single-Word Multiply:

u32_t mul_nr (u32_t* T, const u32_t* A, upL_t L, u32_t B, u32_t cin = 0);

u64_t mul_nr (u64_t* T, const u64_t* A, upL_t L, u64_t B, u64_t cin = 0);

Description:

{T, L} = {A, L} * B + cin

Returns the carryout word.

Performance:

All Cases: O(L)

Basecase Multiplication:

Basecase Multiply:

void sqrB (u32_t* T, const u32_t* A, upL_t AL);

void sqrB (u64_t* T, const u64_t* A, upL_t AL);

void mulB (u32_t* T, const u32_t* A, upL_t AL, const u32_t* B, upL_t BL);

void mulB (u64_t* T, const u64_t* A, upL_t AL, const u64_t* B, upL_t BL);

Description:

Squares or multiplies two numbers using the Basecase multiplication algorithm.

{T, 2*AL} = {A, AL}²

{T, AL + BL} = {A, AL} * {B, BL}

These functions are low-overhead functions that run in O(N²) time. Use these for small inputs only.

Performance:

All Cases: O(AL * BL)

Notes:

Calling the square function may or may not be faster than calling multiply with the same operand.

Full Product Multiply:

Full Product Multiply:

void sqr (const BasicParameters& mp, u32_t* T, const u32_t* A, upL_t AL);

void sqr (const BasicParameters& mp, u64_t* T, const u64_t* A, upL_t AL);

void mul (const BasicParameters& mp, u32_t* T, const u32_t* A, upL_t AL, const u32_t* B, upL_t BL);

void mul (const BasicParameters& mp, u64_t* T, const u64_t* A, upL_t AL, const u64_t* B, upL_t BL);

Description:

Squares or multiplies two numbers.

{T, 2*AL} = {A, AL}²

{T, AL + BL} = {A, AL} * {B, BL}

These functions run in quasi-linear time.

Performance:

All Cases: O( N * log(N) ) where N = AL + BL

BasicParameter (mp) Required Fields:

mp.tw
mp.M
mp.ML
mp.tds

Scratch Memory Buffer Size (for full products):

uiL_t mul_Psize<u32_t> (uiL_t cwordlen, upL_t tds = 1);

uiL_t mul_Psize<u64_t> (uiL_t cwordlen, upL_t tds = 1);

Description:

Returns the size of the scratch buffer needed to call sqr() and mul() with a product length equal to cwordlen.

Performance:

All Cases: O( loglog(AL + BL) )

Notes:

The size returned by this function is only valid for a multiplication using exactly the same size parameters that were passed in.

This is because the value returned by this function is not an increasing function to the cwordlen parameter.

Calling the function with a smaller size may result in a larger memory requirement because the memory requirements of the large multiplication is not an increasing function. Smaller multiplications may require more memory than larger ones.

In other words, do not use this function to get an upper-bound on memory usage for a specific maximum size. For that use mul_iPsize().

Partial Product Multiply:

Partial Product Multiply:

void sqrp (const BasicParameters& mp, u32_t* T, upL_t s, upL_t e, const u32_t* A, upL_t AL);

void sqrp (const BasicParameters& mp, u64_t* T, upL_t s, upL_t e, const u64_t* A, upL_t AL);

void mulp (const BasicParameters& mp, u32_t* T, upL_t s, upL_t e, const u32_t* A, upL_t AL, const u32_t* B, upL_t BL);

void mulp (const BasicParameters& mp, u64_t* T, upL_t s, upL_t e, const u64_t* A, upL_t AL, const u64_t* B, upL_t BL);

Description:

Square or multiply two numbers to get a full product:

{U, 2*AL} = {A, AL}²

{U, AL + BL} = {A, AL} * {B, BL}

{T, e - s} = SubRegion({U, AL + BL})

Then store a truncated sub-region [s, e) of this full product into T. Therefore, the output T must have a length of e - s.

For large inputs, this may be faster than doing a full product and copying out the sub-region.

The primary use for this function is to efficiently perform floating-point multiplication.

Notes:

When s = 0 and e = AL + BL, these functions behave identically to the full product functions.

When s ≠ 0, the result may be inexact. The lowest bit may be off by +/- 1. This error of 1 bit may carry over into higher bits. If it carries all the way past the end index e, it is dropped. Carryout from round-off will never write past end index e.

If the correct full product is zero, the error is always zero.

Examples of possible round-off error:

Correct output: trunc(A * B) = 1234, Actual output: 1235

Correct output: trunc(A * B) = 1234, Actual output: 1233

Correct output: trunc(A * B) = 9999, Actual output: 0000

Correct output: trunc(A * B) = 0000, Actual output: 9999

This error depends on numerous factors and is not consistent between different processors or even the amount of task decomposition for the same input. Nevertheless, it is still deterministic.

Conditions:

0 <= s < e <= AL + BL
The lookup table mp.tw needs to be sufficiently large.
The scratch buffer mp.{M, ML} needs to be sufficiently large and sufficiently aligned.

Performance:

Best Case: O(1)
Average/Worst Case: O( N * log(N) ) where N = AL + BL

BasicParameter (mp) Required Fields:

mp.tw
mp.M
mp.ML
mp.tds

Notes:

The error can be caused by a number of different sources. The most common is premature truncation of the output. When portions of the product are not needed, the implementation will avoid computing them in the first place. While this can improve performance, it means that any carryout that could potentially propagate into the desired portions of the product may be lost.

Because of the nature of the algorithms, this lost carryout may be positive or negative. If an exact result is needed, the only option is to do a full product multiply and manually truncate the result.

When s > 0 and e < AL + BL, it may be possible to use the Middle Product optimization. But YMP currently does not do this yet.

Scratch Memory Buffer Size (for partial products):

uiL_t mulp_Psize<u32_t> (uiL_t s, uiL_t e, uiL_t AL, uiL_t BL, upL_t tds = 1);

uiL_t mulp_Psize<u64_t> (uiL_t s, uiL_t e, uiL_t AL, uiL_t BL, upL_t tds = 1);

Description:

Returns the size of the scratch buffer needed to call sqrp() and mulp() with the exact same parameters.

Performance:

Best Case: O(1)
Average/Worst Case: O( loglog(AL + BL) )

Notes:

The size returned by this function is only valid for a multiplication using exactly the same size parameters that were passed in.

This is because the value returned by this function is not an increasing function to the length parameters and certainly not to the truncation parameters.

Calling the function with a smaller size may result in a larger memory requirement because the memory requirements of the large multiplication is not an increasing function. Smaller multiplications may require more memory than larger ones.

In other words, do not use this function to get an upper-bound on memory usage for a specific maximum size. For that use mul_iPsize().

Incremental Scratch Buffer:

Incremental Scratch Memory Buffer Size (for all products):

uiL_t mul_iPsize<u32_t> (uiL_t cwordlen, upL_t tds = 1);

uiL_t mul_iPsize<u64_t> (uiL_t cwordlen, upL_t tds = 1);

Description:

Returns the size of the scratch buffer needed to call sqr(), mul(), sqrp(), or mulp() with any product length equal to or less than cwordlen and for any task decomposition equal to or less than tds. For the partial products, it will work on any combination of parameters s and e.

The use case of this function is to get an upper-bound for all multiplications so that a single scratch buffer can be allocated to handle any size up the maximum size and task decomposition.

The size returned by this function also works for multiplications done using the transform-only interface which is not yet public.

Performance:

All Cases: O( log(cwordlen) )

Notes:

This is the most powerful of the Psize functions. But it is also the most expensive. It works by iterating through all multiplication algorithms and walking their parameter tables. So it's best not to abuse this function in performance critical places.

y-cruncher Multi-Precision Library A Multi-threaded Multi-Precision Arithmetic Library (Powered by y-cruncher)

y-cruncher Multi-Precision Library

A Multi-threaded Multi-Precision Arithmetic Library

(Powered by y-cruncher)

y-cruncher Multi-Precision Library

A Multi-threaded Multi-Precision Arithmetic Library

(Powered by y-cruncher)