News (2018)

 

Back To:

 


Computation of Pi using a Custom Formula

Custom Formulas: (October 28, 2018) - permalink

 

For those of you who have been paying attention to the record lists, you'll have noticed some new constants which y-cruncher has not supported before. Long story short, I've been experimenting with a custom formula feature similar to that of PiFast's User Constants. And at this point, it's in a good enough shape to announce.

 

y-cruncher's focus has always been to focus on a small number of popular constants and take them to the extreme. But it's been on the back of mind for years to add some sort of formula feature to allow people to use y-cruncher's capabilities to compute other things as well. I'm glad to say that this will finally be coming in v0.7.7.

 

The new formula feature will allow users to input arbitrary (finite sized) formulas with functions such as:

These encompass nearly all of y-cruncher's internal functionality. The AGM and large number square root are new to y-cruncher v0.7.7 and didn't exist before. This set of functions is not final and will probably grow a bit more before v0.7.7 is released.

 

A list of formulas can be found here: https://github.com/Mysticial/y-cruncher-Formulas/tree/master/Formulas

 

 

This custom formula feature may seem like it's coming out of nowhere. But it's been in the making for years. There's a long list of reasons why it took so long:

 

The custom formula feature is not yet complete. Given the scale of such a feature, it will require more than the usual amount of testing. So it's still months away.

 

 


Catalan's Constant is No Longer a Slow Constant: (August 23, 2018) - permalink

 

In what is probably the first major mathematical improvement since the start of the y-cruncher project, there are 3 new formulas for Catalan's Constant.

 

I was recently made aware of a publication by F. M. S. Lima which directly and indirectly led to 3 new formulas for Catalan's Constant which are faster than the Lupas formula which had been the fastest for more than a decade.

Name Formula Speedup vs. Lupas
Guillera 1.7x
Pilehrood (short) 3.8x
Pilehrood (long) 2.5x

All 3 formulas have been implemented for y-cruncher v0.7.7. But the timing of this with respect to y-cruncher's release cycle (having just released v0.7.6), means it will be months before these formulas see the light of day for y-cruncher. But at the very least, we know that they exist and that they work. All 3 implementations have been tested to 1 billion digits. (which didn't take very long)

 

Catalan's Constant has historically been one of the slowest mainstream constants to compute - second only to the Euler-Mascheroni Constant. But thanks to these formulas, Catalan's Constant is now almost as fast as Zeta(3).

 

As only two independent algorithms are needed to establish records, the existing formulas by Lupas and Huvent (as well as Guillera's formula) are now effectively out-of-date. But there are no immediate plans to remove any of the outdated formulas.

 

Mathematics evolves at a much slower pace than computer hardware. So it is easy to forget that math remains one of the most important parts of high precision computations. Without such formulas and algorithms, large computations would be impossible. So we must never stop thanking the generations of mathemeticians, both past and present, that have made such projects possible.

 

While we wait for v0.7.7, two new articles have been added to the algorithms page:

 

 


Version 0.7.6: (August 11, 2018) - permalink

 

A new version is out. This release is mostly a dump of bug fixes, internal refactorings, and secondary features. So not a whole lot on the outside.

 

The only major feature is the "18-CNL" binary for Cannon Lake processors. Due to the lack of suitable hardware from the delays in Intel's 10nm process, it will probably remain untuned and not fully tested until at least 2020. The Core i3 8121U systems that currently exist are too difficult to obtain and are likely "too small" to do the tuning that y-cruncher needs.

 

 

 


Spectre Attack via AVX Clock Speed Throttle: (August 7, 2018) - permalink

 

Somewhat off-topic, but not entirely unrelated since AVX is a common theme here.

 

Apparently, the AVX clock speed throttle present in Intel processors is yet another Spectre attack vector. This is something that I've been thinking about on-and-off since January, but the conceptual breakthrough didn't happen until June.

 

Unbeknownst to me, NetSpectre a closely related exploit, was already in the workings and was publicized in July while my thread with Intel security was still ongoing.

In the end, my communications with Intel were less productive than I had expected. So I'm now publishing the blog describing the attack. (Related Tweet)

 

The contents of the blog are now out-of-date in the context of NetSpectre. But I'll leave it as is - the way I intended to publish it back in June.

 

Feel free to reach out to me about this topic on Twitter or something. While I'm not a researcher in this area, I do have some other ideas for Spectre attacks and enhancements which I'm not really interested in pursuing. So it would be neat if someone else can turn them into working exploits.

 

 

 


AVX512 Scalability - One Year In: (July 2, 2018) - permalink

 

When AVX512 (for consumer hardware) launched a year ago with Skylake X, few applications supported it. y-cruncher was one of those few, but the performance was very disappointing due to all sorts of hardware issues and unexpected performance bottlenecks.

 

Over the past year, work was done to target those bottlenecks. And while they failed to eliminate them, they still improved performance by a lot. So now that there's been enough time to properly optimize and tune for AVX512, we can finally do a fair evaluation of this instruction set for bignum crunching.

 

The table below shows scalability across two dimensions: instruction set (AVX2 -> AVX512) and parallelism. This processor has 10 cores hyperthreaded to 20 threads. The clock speed is normalized to 3.6 GHz for all workloads. This is admittedly unrealistic since AVX512 will usually run at a lower frequency. But it does provide a perfect clock-for-clock comparison between AVX2 and AVX512.

1 Billion Digits of Pi - (Times in Seconds)

Core i9 7900X @ 3.6 GHz - 4 x DDR4 @ 3000 MT/s

  y-cruncher v0.7.3 (July 2017) y-cruncher v0.7.6 (ETA 2018)
  14-BDW (AVX2) 17-SKX (AVX512) Speedup 14-BDW (AVX2) 17-SKX (AVX512) Speedup
1 thread 473.317 331.888 1.43 x 439.967 272.042 1.62 x
20 threads 48.658 39.862 1.22 x 41.833 30.645 1.37 x
Speedup 9.73 x 8.33 x   10.50 x 8.88 x  

The single-threaded benchmarks show the raw speed of AVX2 vs. AVX512. While everything got faster from v0.7.3 to v0.7.6, the AVX512 improved more. This is because v0.7.6 has all the new optimizations that can only be done with real AVX512 hardware. In contrast, v0.7.3 was released only 2 weeks after Skylake X was launched. Much of the AVX512 in v0.7.3 was written years ago using only emulators and before the hardware ever existed in silicon.

 

The multi-threaded runs show smaller speedups from AVX2 to AVX512. This is due to memory bandwidth becoming a factor. Even with 4 channels of high-clocked memory, Skylake X with more than a few cores does not have enough memory bandwidth to feed all the cores. Many of the optimizations between v0.7.3 and v0.7.6 were targeted at alleviating this bottleneck.

 

 

So as of 2018, the raw AVX2 -> AVX512 speedup is 62% on Skylake X with dual 512-bit FMAs. Given the amount of code that remains unvectorized, 62% is reasonable due to Amdahl's Law. Furthermore, AVX512 only doubles up the floating-point capability. Many integer operations fall well short of that.

 

But once everything else is factored in (AVX512 clock speed throttle and memory bandwidth with multi-threading), the real speedup of AVX512 on a large Skylake X processor with a lot of cores dwindles to around 20 - 30%. Disappointing? Yes. But not unsurprising for new technology. Perhaps things will be better when DDR5 becomes a thing.

 

 

Cannon Lake?

 

Maybe it's too early to start talking about Cannon Lake since it will still be a long time before they hit the market in volume.

250 Million Digits of Pi - (Times in Seconds)

Core i3-8121U @ unknown fixed clock speed*

  y-cruncher v0.7.6 (ETA 2018)
  14-BDW (AVX2) 17-SKX (AVX512-DQ) 18-CNL (AVX512-VBMI)
1 thread 100.057 87.840 66.510

The Core i3-8121U is a 10nm Cannon Lake processor with only one 512-bit FMA. Having only one FMA severely limits the performance of the baseline AVX512. But the new binary seems to care less. It's way too early to draw any conclusions yet.

 

*Credit to Jzw for testing this.

 

 

 


Pi Day and "houkouonchi": (March 14, 2018) - permalink

 

For those who have been following the Pi computation world records, you'll know that "houkouonchi" is obviously a pseudonym. Back in 2014 when he set the Pi world record with 13.3 trillion digits, he asked me not to reveal his real name. His reason: He didn't want to be bothered by people contacting him through his facebook and personal email.

 

However, houkouonchi sadly passed away in 2015.

 

Being an internet contact, I didn't find out about it for almost a year. Furthermore, I had no contact information for his family members.

 

For the past 2 years, I've been torn on whether or not to reveal his real name. On one hand, he asked me not reveal his name. But on the other hand, I felt a strong desire to put his name on his world record. Without any contact information, I've been unable to reach out to his family. And nobody is watching his email as my messages have remained unanswered.

 

In the end, I decided that his original reason for being anonymous is no longer applicable. Therefore I will now put a name to the record of 13.3 trillion digits.

 

His name is Sandon Van Ness. Rest in peace my friend.

 

 

 


Page Table Isolation and Large Pages: (January 7, 2018) - permalink

 

If you follow tech news, you should be well aware of the Meltdown and Spectre side-channel attacks that affect nearly all processors with speculative execution. Furthermore, the patches come with performance penalties that range anywhere from negligible to a ridiculous 50% depending on the application and hardware.

 

Is y-cruncher affected? Yes. But it may be avoidable under certain circumstances.

 

 

Hardware:

The following table compares performance with and without KPTI for Meltdown. Unfortunately, no BIOS/microcode update for the Spectre patch was available to test. Given the age of the system, it seems unlikely that the manufacturer will provide such an update.

1 billion digits of Pi y-cruncher v0.7.4
  Normal Pages (4 KB) Large Pages (2 MB)
No Patches 107.448 106.803
Kernel Page Table Isolation (KPTI) 110.418 106.388

Notes:

y-cruncher spends very little time in the kernel. So based on that, one would expect the effect of KPTI to be negligible. However, there are a lot of small system calls from all the multi-threading related constructs. (mutexes, condition variables, signals, etc...)

 

In the end, we see a 3% performance impact when using normal (4 KB) pages. But when switching to large (2 MB) pages, that penalty disappears.

 

A possible explanation for this is that each system call that goes into kernel mode will cause a TLB flush upon its return. So even if the system call is short, it leads to a flood of TLB misses as the computation resumes and has to re-populate the TLB. Since y-cruncher has a massive memory footprint, there will be a lot of pages to re-populate. With large pages, there are far fewer of them - thereby drastically reducing the performance penalty. Though this explanation has issues since PCID should (theoretically) be eliminating the TLB flushes as far as I understand (which I admit I don't).

 

Regardless of the exact reason for why large pages help so much, let's not get too excited. This is just a single benchmark on a single platform. Things may look different on other systems. Furthermore, there are requirements to enable large pages - some of which may be inconvenient.

 

Those interested in testing out large pages for y-cruncher can refer to the memory allocation guide.

 

 

Looking forward, the current development version of v0.7.5 is showing significantly less penalty from KPTI:

1 billion digits of Pi y-cruncher v0.7.5 (trunk)
  Normal Pages (4 KB) Large Pages (2 MB)
No Patches 104.337 102.995
Kernel Page Table Isolation (KPTI) 104.581 103.142

It's unclear why this is the case. But it could be a side-effect of the new bandwidth optimizations.

 

Version v0.7.5 is currently not ready for release. However, it is in feature freeze.

 

 

Spectre Mitigations:

 

So far, I have yet to test the impact of the Spectre mitigations.