y-cruncher Multi-Precision Library |
||
A Multi-threaded Multi-Precision Arithmetic Library(Powered by y-cruncher) |
(Last updated: March 3, 2016)
Shortcuts:
YMP Library:
Large Number Objects:
Low-Level Programming:
Low-Level Functions
Low-Level functions are the most basic and fundamental operations in the y-cruncher project. These functions are designed to be fast and minimal. They provide little to no abstraction and are therefore hard to use.
These low-level functions map one-to-one with internal y-cruncher routines and are analagous to GMP's "mpn" layer.
The main difference is that GMP's mpn-layer is more feature rich whereas YMP's low-level layer is more focused on beefing up the large multiplication with features like parallelism and memory management.
Data Representation:
Low-Level functions operate directly on raw arrays of 32 or 64 bit integer words. These words represent "digits" in base 2^{32} or 2^{64} laid out in little-endian order.
The integer that is represented is as follows:
large integer = A[0] + A[1]*base^{1} + A[2]*base^{2} + A[3]*base^{3} + ...
By design, all low-level functions support both 32-bit and 64-bit words for all architectures. But the best word-size is usually the one matching the native word size.
General Notes:
Performance Notes:
Missing Functionality:
The functions documented here cover most of the low-level functionality in y-cruncher. The only major one that is missing is the transform-only interface which is not user-friendly enough to enable yet.
Functions such as division and radix conversion are missing because they are not part of y-cruncher's low-level functionality. These are done at a higher level since they use algorithms that are difficult to implement efficiently without object abstraction.
Function List:
Large Multiplication:
Add Carry and Propagate:
void add_ip (u32_t* T, u32_t cin);
void add_ip (u64_t* T, u64_t cin);
Description:
If cin is zero, do nothing.
If cin is not zero, add 1 to T and propagate as far as necessary. The caller must ensure that it will never propagate into unmapped memory.
Performance:
Add Carry, Propagate, and Return:
u32_t add_ir (u32_t* T, upL_t L, u32_t cin);
u64_t add_ir (u64_t* T, upL_t L, u64_t cin);
Description:
If cin is zero, do nothing.
If cin is not zero, add 1 to {T, L}. Return the carryout after propagating L words. The carryout will always be 0 or 1.
Performance:
In-place Add/Subtract and Propagate Carry:
void add_ip (u32_t* T, const u32_t* A, upL_t L, u32_t cin = 0);
void add_ip (u64_t* T, const u64_t* A, upL_t L, u64_t cin = 0);
void sub_ip (u32_t* T, const u32_t* A, upL_t L, u32_t cin = 0);
void sub_ip (u64_t* T, const u64_t* A, upL_t L, u64_t cin = 0);
Description:
If cin is zero:
{T, L} = {T, L} + {A, L} and propagate as far as needed.
{T, L} = {T, L} - {A, L} and propagate as far as needed.
otherwise:
{T, L} = {T, L} + {A, L} + 1 and propagate as far as needed.
{T, L} = {T, L} - {A, L} - 1 and propagate as far as needed.
Performance:
In-place Add/Subtract and Return Carry:
u32_t add_ir (u32_t* T, const u32_t* A, upL_t L, u32_t cin = 0);
u64_t add_ir (u64_t* T, const u64_t* A, upL_t L, u64_t cin = 0);
u32_t sub_ir (u32_t* T, const u32_t* A, upL_t L, u32_t cin = 0);
u64_t sub_ir (u64_t* T, const u64_t* A, upL_t L, u64_t cin = 0);
Description:
If cin is zero:
{T, L} = {T, L} + {A, L} and return carryout.
{T, L} = {T, L} - {A, L} and return carryout.
otherwise:
{T, L} = {T, L} + {A, L} + 1 and return carryout.
{T, L} = {T, L} - {A, L} - 1 and return carryout.
Performance:
Out-of-place Add/Subtract and Return Carry:
u32_t add_nr (u32_t* T, const u32_t* A, const u32_t* B, upL_t L, u32_t cin = 0);
u64_t add_nr (u64_t* T, const u64_t* A, const u64_t* B, upL_t L, u64_t cin = 0);
u32_t sub_nr (u32_t* T, const u32_t* A, const u32_t* B, upL_t L, u32_t cin = 0);
u64_t sub_nr (u64_t* T, const u64_t* A, const u64_t* B, upL_t L, u64_t cin = 0);
Description:
If cin is zero:
{T, L} = {A, L} + {B, L} and return carryout.
{T, L} = {A, L} - {B, L} and return carryout.
otherwise:
{T, L} = {A, L} + {B, L} + 1 and return carryout.
{T, L} = {A, L} - {B, L} - 1 and return carryout.
Performance:
In-place Single-Word Multiply:
u32_t mul_ir (u32_t* T, upL_t L, u32_t B, u32_t cin = 0);
u64_t mul_ir (u64_t* T, upL_t L, u64_t B, u64_t cin = 0);
Description:
{T, L} = {T, L} * B + cin
Returns the carryout word.
Performance:
Out-of-place Single-Word Multiply:
u32_t mul_nr (u32_t* T, const u32_t* A, upL_t L, u32_t B, u32_t cin = 0);
u64_t mul_nr (u64_t* T, const u64_t* A, upL_t L, u64_t B, u64_t cin = 0);
Description:
{T, L} = {A, L} * B + cin
Returns the carryout word.
Performance:
Basecase Multiply:
void sqrB (u32_t* T, const u32_t* A, upL_t AL);
void sqrB (u64_t* T, const u64_t* A, upL_t AL);
void mulB (u32_t* T, const u32_t* A, upL_t AL, const u32_t* B, upL_t BL);
void mulB (u64_t* T, const u64_t* A, upL_t AL, const u64_t* B, upL_t BL);
Description:
Squares or multiplies two numbers using the Basecase multiplication algorithm.
{T, 2*AL} = {A, AL}^{2}
{T, AL + BL} = {A, AL} * {B, BL}
These functions are low-overhead functions that run in O(N^{2}) time. Use these for small inputs only.
Performance:
Notes:
Calling the square function may or may not be faster than calling multiply with the same operand.
Full Product Multiply:
void sqr (const BasicParameters& mp, u32_t* T, const u32_t* A, upL_t AL);
void sqr (const BasicParameters& mp, u64_t* T, const u64_t* A, upL_t AL);
void mul (const BasicParameters& mp, u32_t* T, const u32_t* A, upL_t AL, const u32_t* B, upL_t BL);
void mul (const BasicParameters& mp, u64_t* T, const u64_t* A, upL_t AL, const u64_t* B, upL_t BL);
Description:
Squares or multiplies two numbers.
{T, 2*AL} = {A, AL}^{2}
{T, AL + BL} = {A, AL} * {B, BL}
These functions run in quasi-linear time.
Performance:
BasicParameter (mp) Required Fields:
Scratch Memory Buffer Size (for full products):
uiL_t mul_Psize<u32_t> (uiL_t cwordlen, upL_t tds = 1);
uiL_t mul_Psize<u64_t> (uiL_t cwordlen, upL_t tds = 1);
Description:
Returns the size of the scratch buffer needed to call sqr() and mul() with a product length equal to cwordlen.
Performance:
Notes:
The size returned by this function is only valid for a multiplication using exactly the same size parameters that were passed in.
This is because the value returned by this function is not an increasing function to the cwordlen parameter.
Calling the function with a smaller size may result in a larger memory requirement because the memory requirements of the large multiplication is not an increasing function. Smaller multiplications may require more memory than larger ones.
In other words, do not use this function to get an upper-bound on memory usage for a specific maximum size. For that use mul_iPsize().
Partial Product Multiply:
void sqrp (const BasicParameters& mp, u32_t* T, upL_t s, upL_t e, const u32_t* A, upL_t AL);
void sqrp (const BasicParameters& mp, u64_t* T, upL_t s, upL_t e, const u64_t* A, upL_t AL);
void mulp (const BasicParameters& mp, u32_t* T, upL_t s, upL_t e, const u32_t* A, upL_t AL, const u32_t* B, upL_t BL);
void mulp (const BasicParameters& mp, u64_t* T, upL_t s, upL_t e, const u64_t* A, upL_t AL, const u64_t* B, upL_t BL);
Description:
Square or multiply two numbers to get a full product:
{U, 2*AL} = {A, AL}^{2}
{U, AL + BL} = {A, AL} * {B, BL}
{T, e - s} = SubRegion({U, AL + BL})
Then store a truncated sub-region [s, e) of this full product into T. Therefore, the output T must have a length of e - s.
For large inputs, this may be faster than doing a full product and copying out the sub-region.
The primary use for this function is to efficiently perform floating-point multiplication.
Notes:
- When s = 0 and e = AL + BL, these functions behave identically to the full product functions.
- When s ≠ 0, the result may be inexact. The lowest bit may be off by +/- 1. This error of 1 bit may carry over into higher bits. If it carries all the way past the end index e, it is dropped. Carryout from round-off will never write past end index e.
- If the correct full product is zero, the error is always zero.
Examples of possible round-off error:
- Correct output: trunc(A * B) = 1234, Actual output: 1235
- Correct output: trunc(A * B) = 1234, Actual output: 1233
- Correct output: trunc(A * B) = 9999, Actual output: 0000
- Correct output: trunc(A * B) = 0000, Actual output: 9999
This error depends on numerous factors and is not consistent between different processors or even the amount of task decomposition for the same input. Nevertheless, it is still deterministic.
Conditions:
Performance:
BasicParameter (mp) Required Fields:
Notes:
The error can be caused by a number of different sources. The most common is premature truncation of the output. When portions of the product are not needed, the implementation will avoid computing them in the first place. While this can improve performance, it means that any carryout that could potentially propagate into the desired portions of the product may be lost.
Because of the nature of the algorithms, this lost carryout may be positive or negative. If an exact result is needed, the only option is to do a full product multiply and manually truncate the result.
When s > 0 and e < AL + BL, it may be possible to use the Middle Product optimization. But YMP currently does not do this yet.
Scratch Memory Buffer Size (for partial products):
uiL_t mulp_Psize<u32_t> (uiL_t s, uiL_t e, uiL_t AL, uiL_t BL, upL_t tds = 1);
uiL_t mulp_Psize<u64_t> (uiL_t s, uiL_t e, uiL_t AL, uiL_t BL, upL_t tds = 1);
Description:
Returns the size of the scratch buffer needed to call sqrp() and mulp() with the exact same parameters.
Performance:
Notes:
The size returned by this function is only valid for a multiplication using exactly the same size parameters that were passed in.
This is because the value returned by this function is not an increasing function to the length parameters and certainly not to the truncation parameters.
Calling the function with a smaller size may result in a larger memory requirement because the memory requirements of the large multiplication is not an increasing function. Smaller multiplications may require more memory than larger ones.
In other words, do not use this function to get an upper-bound on memory usage for a specific maximum size. For that use mul_iPsize().
Incremental Scratch Memory Buffer Size (for all products):
uiL_t mul_iPsize<u32_t> (uiL_t cwordlen, upL_t tds = 1);
uiL_t mul_iPsize<u64_t> (uiL_t cwordlen, upL_t tds = 1);
Description:
Returns the size of the scratch buffer needed to call sqr(), mul(), sqrp(), or mulp() with any product length equal to or less than cwordlen and for any task decomposition equal to or less than tds. For the partial products, it will work on any combination of parameters s and e.
The use case of this function is to get an upper-bound for all multiplications so that a single scratch buffer can be allocated to handle any size up the maximum size and task decomposition.
The size returned by this function also works for multiplications done using the transform-only interface which is not yet public.
Performance:
Notes:
This is the most powerful of the Psize functions. But it is also the most expensive. It works by iterating through all multiplication algorithms and walking their parameter tables. So it's best not to abuse this function in performance critical places.