Inverse Square Root

By Alexander Yee

(Last updated: February 5, 2024)

Back To:

y-cruncher uses Newton's Method iteration for performing the Inverse Square Root.

This operation is needed for the following constants:

Square Root of x
Golden Ratio
Pi - Chudnovsky/Ramanujan

Algorithm

Find:

Start with an initial guess:

Iterate:

Since this is a 1st order iteration, the number of correct digits doubles with each iteration.

There are several ways to arrange the formula. This particular arrangement was chosen because r_n² x - 1 has destructive cancellation of roughly half the digits. This cancellation reduces the size of the final multiply by r_n.

Implementation

In y-cruncher, the initial "guess" is obtained by simply calling std::sqrt() in <cmath> which gives 53 bits.

At each stage, y-cruncher's implementation will choose either of the following iterations:

Which it chooses depends on whether r_n turns out to be an under-estimate or an over-estimate of the correct answer. If it is an under-estimate, it uses the second formula to avoid negative numbers. Over-estimates are faster because subtraction by 1 is free whereas subtracting from 1 requires inversion of all the bits.

Therefore, y-cruncher's implementation will purturb the result of each iteration slightly to increase the chances that it will be an overestimate for the next iteration.

Lastly, there are two multiplications by r_n.

r_n²
r_n * (...)

Since the multiplies are the same size, y-cruncher reuses the forward FFT for r_n.

Complexity Analysis

Complexity (Asymptotic):

Inverse Square Root: O( n log(n) )

The complexity of each iteration is O( M(n) ). Since Newton's Method is a self-correcting algorithm, full precision is not needed thoughout the iteration. Therefore the sum of all the iterations is a geometric sequence resulting in the same complexity: O( M(n) ).

If we assume that multiplication is M(n) = n log(n), then the entire operation is O( n log(n) ).

Complexity (Practical):

y-cruncher's inverse square root for small x is currently:

Memory: 1.33 M(n)

Swap: 1.50 M(n)

Each iteration contains:

1 Half-precision square
1 Half-precision multiply
2 Add/Subtracts (negligible)
1 Divide-by-two (negligible)

One forward FFT is reused for a total of 2 forward and 2 inverse FFTs. y-cruncher also has an implementation for large x, but since those aren't needed for Pi, it's poorly optimized and won't be covered here.

For memory computations, we can assume that forward and inverse FFTs are of equal cost. Therefore a multiplication is equal to 3 FFTs.

For swap computations, the inverse FFT is about 2x the cost of a forward FFT. Therefore a multiplication is approximately 4 forward FFTs.

Operation	Size	Quantity	Total Cost	Total Cost (Memory)	Total Cost (Swap)
Forward FFT	0.5 p	2	2 * F(0.5 p)	0.33 * M(p)	0.25 * M(p)
Inverse FFT	0.5 p	2	2 * I(0.5 p)	0.33 * M(p)	0.50 * M(p)
Add/Subtract	1.0 p	1	-	-	-
Divide by 2	1.0 p	1	-	-	-
Total				0.66 * M(p)	0.75 * M(p)

Summing these up through a binary geometric sequence gives exactly twice these numbers.