Inverse Square Root

By Alexander Yee

 

(Last updated: February 5, 2024)

 

Back To:

 

y-cruncher uses Newton's Method iteration for performing the Inverse Square Root.

 

This operation is needed for the following constants:

 

Algorithm

 

Find:

Start with an initial guess:

Iterate:

Since this is a 1st order iteration, the number of correct digits doubles with each iteration.

 

There are several ways to arrange the formula. This particular arrangement was chosen because rn2 x - 1 has destructive cancellation of roughly half the digits. This cancellation reduces the size of the final multiply by rn.

 

 

Implementation

 

In y-cruncher, the initial "guess" is obtained by simply calling std::sqrt() in <cmath> which gives 53 bits.

 

At each stage, y-cruncher's implementation will choose either of the following iterations:

Which it chooses depends on whether rn turns out to be an under-estimate or an over-estimate of the correct answer. If it is an under-estimate, it uses the second formula to avoid negative numbers. Over-estimates are faster because subtraction by 1 is free whereas subtracting from 1 requires inversion of all the bits.

Therefore, y-cruncher's implementation will purturb the result of each iteration slightly to increase the chances that it will be an overestimate for the next iteration.

 

Lastly, there are two multiplications by rn.

Since the multiplies are the same size, y-cruncher reuses the forward FFT for rn.

 

Complexity Analysis

 

Complexity (Asymptotic):

 

Inverse Square Root: O( n log(n) )

 

The complexity of each iteration is O( M(n) ). Since Newton's Method is a self-correcting algorithm, full precision is not needed thoughout the iteration. Therefore the sum of all the iterations is a geometric sequence resulting in the same complexity: O( M(n) ).

 

If we assume that multiplication is M(n) = n log(n), then the entire operation is O( n log(n) ).

 

 

 

Complexity (Practical):

 

y-cruncher's inverse square root for small x is currently:

 

Memory: 1.33 M(n)

Swap: 1.50 M(n)

 

Each iteration contains:

One forward FFT is reused for a total of 2 forward and 2 inverse FFTs. y-cruncher also has an implementation for large x, but since those aren't needed for Pi, it's poorly optimized and won't be covered here.

 

For memory computations, we can assume that forward and inverse FFTs are of equal cost. Therefore a multiplication is equal to 3 FFTs.

For swap computations, the inverse FFT is about 2x the cost of a forward FFT. Therefore a multiplication is approximately 4 forward FFTs.

Operation Size Quantity Total Cost Total Cost (Memory) Total Cost (Swap)
Forward FFT 0.5 p 2 2 * F(0.5 p) 0.33 * M(p) 0.25 * M(p)
Inverse FFT 0.5 p 2 2 * I(0.5 p) 0.33 * M(p) 0.50 * M(p)
Add/Subtract 1.0 p 1 - - -
Divide by 2 1.0 p 1 - - -
Total       0.66 * M(p) 0.75 * M(p)

Summing these up through a binary geometric sequence gives exactly twice these numbers.