Back To:
Version 0.8.2 Released: (September 4, 2023) - permalink
Part 2 of the revamp is now complete! While part 1 (version v0.8.1) rewrote the large multiplication for in-memory computations, this release finishes the job by extended it to swap computations as well.
Because this release completely replaces the old disk multiplication, it is strongly recommended to test things at scale if you plan on doing any large compuations (such as a Pi record). I have not personally tested anything above 1 trillion digits.
So with this release, the revamp is effectively complete:
While I had plans to also rewrite the floating-point FFTs, those are being shelved for later as the cost/benefit doesn't meet the bar against other higher priority tasks.
So as of this release (v0.8.2), y-cruncher peaks at 729,000 lines of code. Now the cleanup begins. The 5 algorithms (all except VST are already disabled in v0.8.2) will be removed from the codebase for a reduction of at least 240,000 lines of code. Meanwhile, the N63 and VT3 algorithms have added 133,000 lines of code.
Thus the net change of the entire revamp will be a reduction of roughly 100,000 lines of code with a total of just under 400,000 lines touched - which is close to what I had initially predicted.
Part 3 of the revamp is just cleanup and will involve removing the code for the 5 obsolete algorithms. Since this involves a removal of features with nothing added in return, it will not get its own release and will simply be rolled into the next set of improvements.
The VST algorithm, despite being popular for stress-testing, will not be spared as it has dependencies on much of the 240,000 lines that will be removed. So it will be removed in v0.8.3 whenever that might be. The overclocking community has already shown that VT3 is the stronger stress-test so this should not be a huge loss.
As for performance changes, don't expect the same massive speedups for swap mode that v0.8.1 brought to in-memory computations. Swap mode computations were (and still are) disk-bound. So the computational improvements which gave v0.8.1 the large speedups will not translate proportionally to computations on disk.
Material improvements to swap mode are slated for the future. While this release lays down much of the groundwork for future improvements, a lot more work and research is needed to get there.
What's next?
The cleanup and removal of the old algorithms is already done on the main development branch. So with the project at a good stopping point, I'll be taking a break from any major developments. I'm also not in the mood to do anything since I recently lost a very close member of the family.
Regardless, I intend to continue doing new binaries for whatever new and interesting processors I can get my hands on. Just don't expect any big changes (like the v0.8.1 improvements) for a while.
In memory of my uncle Robbie whom I was extremely close to and was effectively my 3rd parent growing up. Rest in peace. You'll be missed dearly. I will drive your Tesla someday, though it might be a while.
Version 0.8.1 Released: (July 11, 2023) - permalink
And it's finally here! Part one of the revamp is now complete. This release brings forward the newly rewritten algorithms which will have the most performance impact for in-memory computations.
Here are some benchmarks showing the improvements brought by v0.8.1 and AVX512. Because of the large performance swings, HWBOT integration will be withheld until the HWBOT community decides what to do.
|
|
||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||
|
|
Last year when I did the Zen 4 optimizations, I was disappointed (but not surprised) that I was only able to gain 1-2% speedup with AVX512. In fact, this was so embarrassingly bad that I couldn't publish any numbers. Sure, Zen 4's AVX512 is "double-pumped" and doesn't have wider units. But there's a lot more to AVX512 than just the 512-bit width.
In reality, I was able to achieve around 10% speedup for AVX512 on Zen 4 - but only within cache. Upon scaling it up, it was completely wiped out by the memory inefficiencies in the old algorithm. And it certainly didn't help that Zen 4 set a new record for insufficient memory bandwidth.
This memory bottleneck I suspect is the primary reason why the overall benefit of AVX512 remains higher on Intel than AMD even in v0.8.1. y-cruncher has been memory-bound on every high-end chip since 2017 with AMD faring worse due to having twice as many cores and lower memory speeds. While it's also tempting to blame Zen 4's "double-pumped" AVX512 as part of the problem, in reality it isn't much worse than Intel chips that lack the second 512-bit FMA.
Memory bandwidth as a whole has been a problem that has gone completely out of control. Since 2015, computational power has increased by more than 5x while memory bandwidth has barely improved by 50%. Needless to say, this trend is completely unsustainable at least for this field of high performance computing.
Stress Testing:
Testing and validation of v0.8.1 was done on 8 computers which were long believed to be stable (most aren't even overclocked). All 8 of these machines held against older versions of y-cruncher during past releases. But for this release, 2 of them were found to be unstable. Neither were overclocked and were completely within spec.
Neither machine could be fixed by downclocking or overvolting. One of them (an Intel laptop) had to be retired. The other (a custom-built AMD desktop) was eventually stabilized by changing the motherboard. (Yes, this was a huge headache and a massive distraction from the software development.)
What does this mean for stress-testing? While it's tempting to conclude that v0.8.1 is more stressful than older versions, this sample size of 8 really isn't enough. So I'll leave it to the rest of the overclocking community to decide. The specific stress-test you want to run is called "VT3" which is the newly rewritten version of the "VST" test that everyone seems to love. Likewise, any large in-memory computation will be running the new code.
The Hybrid NTT Algorithm:
As promised in the previous announcement, y-cruncher's good old Hybrid NTT algorithm has now been published here. Despite its importance to y-cruncher's early days, it is not as conceptually spectacular as one would assume by modern (adult) standards. But as a kid when I first wrote it, it was amazing.
Anyways, I hope everyone enjoys this new version. As mentioned, this is just part one of the ongoing rewrite of the internal algorithms. While there's still a lot of work to do (including optimizations), development will now shift to swap mode. So in the short term, I don't expect any more performance swings beyond compiler changes and new optimizations for new processors.
Upcoming Changes for v0.8.x: (June 7, 2023) - permalink
In an effort to clean up and modernize the project, most of the large multiply algorithms are getting either refreshed or removed. Algorithms that are useful on modern processors are getting redesigned and rewritten from scratch while the rest will be completely removed from the codebase.
The implication of this will be performance gains on newer processors and regressions on older processors.
If this sounds big, it is. More than 400,000 lines of code will be touched. Work actually began more than 3 years ago, but very little progress was made until this year where I'm on garden leave and therefore not working.
As of today, enough has been done to get some preliminary in-memory benchmarks:
Processor | Architecture | Clock Speeds | Binary | ISA | Pi computation Speedup vs. v0.7.10 | |
Core i7 920 | Intel Nehalem | 2008 | 3.5 GHz + 3 x 1333 MT/s | 08-NHM ~ Ushio | x64 SSE4.1 | -27% |
Core i7 3630QM | Intel Ivy Bridge | 2012 | stock + 2 x 1600 MT/s | 11-SNB ~ Hina | x64 AVX | -10% |
FX-8350 | AMD Piledriver | 2012 | stock + 2 x 1600 MT/s | 11-BD1 ~ Miyu | x64 FMA4 | -1% |
Core i7 5960X | Intel Haswell | 2013 | 4.0 GHz + 4 x 2400 MT/s | 13-HSW ~ Airi | x64 AVX2 | 3 - 4% |
Core i7 6820HK | Intel Skylake | 2015 | stock + 2 x 2133 MT/s | 14-BDW ~ Kurumi | x64 AVX2 + ADX | 4 - 7% |
Ryzen 7 1800X | AMD Zen 1 | 2017 | stock + 2 x 2866 MT/s | 17-ZN1 ~ Yukina | x64 AVX2 + ADX | ~1% |
Core i9 7900X | Intel Skylake X | 2017 | 3.6 GHz (AVX512) + 4 x 3000 MT/s | 17-SKX ~ Kotori | x64 AVX512-DQ | 6 - 9% |
Core i9 7940X | 3.6 GHz (AVX512) + 4 x 3466 MT/s | 10 - 13% | ||||
Ryzen 9 3950X | AMD Zen 2 | 2019 | stock + 2 x 3000 MT/s | 19-ZN2 ~ Kagari | x64 AVX2 + ADX | 13 - 14% |
Core i3 8121U | Intel Cannon Lake | 2018 | stock + 2 x 2400 MT/s | 18-CNL ~ Shinoa | x64 AVX512-VBMI | 16 - 17% |
Core i7 1165G7 | Intel Tiger Lake | 2020 | stock + 2 x 2666 MT/s | 12 - 22% | ||
Core i7 11800H | stock + 2 x 3200 MT/s | 23 - 27% | ||||
Ryzen 9 7950X | AMD Zen 4 | 2022 | stock + 2 x 4400 MT/s | 22-ZN4 ~ Kizuna | x64 AVX512-GFNI | 23 - 31% |
The loss of performance for the oldest processors is primarily due to the removal of the Hybrid NTT. Yes, the Hybrid NTT that started the entire y-cruncher project is now gone. While it was the fastest thing in 2008, it unfortunately did not age very well. Stay tuned for a future blog about the algorithm. It will no longer be a secret.
Overall, there is still a lot of work to do. For example, swap-mode is still using the old implementations and will need to be revamped as well. But since the new code has reached or exceeded performance parity for the chips I care about, this is a good stopping point for v0.8.1 pending testing and validation.
Nevertheless, the benchmarks above are not final and are subject to change. Specifically, there are unresolved toolchain issues where Intel is removing their old compiler while its replacement is still significantly worse. And it's unclear whether it can be fixed before it is no longer possible to keep using their old compiler.
A big unknown is how stress-testing will be affected. Despite not being designed for this purpose, y-cruncher's stress-test is notorious for its ability to expose memory instabilities that other (even dedicated) memory testing applications cannot. In other words, it is one of the best memory testers out there. But with so much stuff being rewritten, there's no telling how this will change. Nevertheless, it doesn't make a whole lot of sense to keep around hundreds of thousands of lines of old code if turns out to be the better stress test.
So yeah... Out with the old and in with the new. Expect to see Zen 4 gaining up to 20% speedup with AVX512 vs. just AVX2 - no wider execution units needed.
The Need for Speed!: (April 19, 2023) - permalink
Jordan Ranous from StorageReview has just flexed a system that matched Google's 100 trillion digit world record in just 59 days. You can read more about it here:
I'm not sure if Google's record used SSDs or hard drives, but if the latter, this would be the first large computation done entirely on SSDs.
It's probably safe to say that since StorageReview is able to match the world record in a fraction of the time, they are more than capable of beating it. So everyone else better watch out!
Intel Optimizations (or lack of): (April 17, 2023) - permalink
I've been asked a number of times about why I haven't done any optimizations for recent Intel processors. The latest Intel processor which y-cruncher has optimizations for is Tiger Lake which is 2 generations behind the latest (Raptor Lake). And because Raptor Lake lacks AVX512, it can only run a binary going all the way back to Skylake client (circa 2015).
There are a number of reasons for this:
Removing AVX512 is a huge step back in more ways than just the instruction width. It also removes all the other (non-width) functionality exclusive to AVX512 such as masking, all-to-all permutes, and increased register count. From a developer perspective, this very discouraging since most of the algorithms I've been working on since 2016 have been heavily influenced by (if not outright designed for) AVX512.
The lack of AVX512 is likely why Tiger Lake and Rocket Lake outperform Alder Lake in single-threaded benchmarks where memory bandwidth and core count are not a factor.
The split of P and E cores is quite frankly a nightmare to optimize for at all levels:
This is not to say it's impossible to optimize for heterogeneous computing, but it is not a direction that I would like to move y-cruncher towards.
Obviously, Intel had their reasons to do this. Client processors are not generally used for HPC, they are used for desktop applications - like gaming. I suspect that Intel went in this direction in an attempt to remain competitive with AMD once it became apparent that the only way they could match AMD in both single-threaded and multi-threaded performance was to build a chip that had P-cores specifically for single-threaded tasks and E-cores for multi-threaded ones.
That said, AMD also seems to be moving in the direction of heterogeneous computing with the 2-CCD Zen4 3D V-Cache processors. This is also something that cannot be easily optimized for.
What about the server chips?
Server chips for both Intel and AMD remain sane for now. And I hope they stay that way since this is where the majority of HPC lives. And while there's room here, (in particular: Intel's Sapphire Rapids and AMD's Genoa-X with 3D V-Cache), they all remain far beyond my personal budget. So barring a sponsor or a donation, I'm unlikely to target these systems any time soon.
In the end, none of this matters a whole lot because of memory bandwidth. With or without AVX512, and with or without optimized code for each core type, memory bandwidth holds everything back. And this applies to both Intel and AMD. Thus from a developer perspective, it makes zero sense to go to hell and back to deal with heterogeneous computing when it won't matter much anyway. Furthermore, this heterogeneous computing revolution is different from the multi-core revolution of 2 decades ago in that parallel computing brought unbounded performance gains whereas heterogeneous computing can only squeeze out a (small) constant factor of speedup.
So rather than figuring out heterogeneous chips, most of the work has been research on memory bandwidth. y-cruncher's existing algorithms already have their space-time-tradoff sliders completely maxed out in the direction of reducing memory/bandwith at the cost of additional computation. So new stuff will be needed.
One of the biggest weaknesses in y-cruncher is its inability to fully utilize modern caches which are large, shared, and deeply hierarchical. So on paper, AMD shows the most potential for future improvement here because that massive 3D V-Cache is very much underutilized.