News (2023)


Back To:


The Need for Speed!: (April 19, 2023) - permalink


Jordan Ranous from StorageReview has just flexed a system that matched Google's 100 trillion digit world record in just 59 days. You can read more about it here:

I'm not sure if Google's record used SSDs or hard drives, but if the latter, this would be the first large computation done entirely on SSDs.


It's probably safe to say that since StorageReview is able to match the world record in a fraction of the time, they are more than capable of beating it. So everyone else better watch out!



Intel Optimizations (or lack of): (April 17, 2023) - permalink


I've been asked a number of times about why I haven't done any optimizations for recent Intel processors. The latest Intel processor which y-cruncher has optimizations for is Tiger Lake which is 2 generations behind the latest (Raptor Lake). And because Raptor Lake lacks AVX512, it can only run a binary going all the way back to Skylake client (circa 2015).


There are a number of reasons for this:

  1. Removal of AVX512 starting from Alder Lake.
  2. Heterogeneous computing (splitting of P and E-cores) is difficult to optimize for.
  3. Memory bandwidth remains the biggest bottleneck.

Removing AVX512 is a huge step back in more ways than just the instruction width. It also removes all the other (non-width) functionality exclusive to AVX512 such as masking, all-to-all permutes, and increased register count. From a developer perspective, this very discouraging since most of the algorithms I've been working on since 2016 have been heavily influenced by (if not outright designed for) AVX512.


The lack of AVX512 is likely why Tiger Lake and Rocket Lake outperform Alder Lake in single-threaded benchmarks where memory bandwidth and core count are not a factor.


The split of P and E cores is quite frankly a nightmare to optimize for at all levels:

This is not to say it's impossible to optimize for heterogeneous computing, but it is not a direction that I would like to move y-cruncher towards.


Obviously, Intel had their reasons to do this. Client processors are not generally used for HPC, they are used for desktop applications - like gaming. I suspect that Intel went in this direction in an attempt to remain competitive with AMD once it became apparent that the only way they could match AMD in both single-threaded and multi-threaded performance was to build a chip that had P-cores specifically for single-threaded tasks and E-cores for multi-threaded ones.


That said, AMD also seems to be moving in the direction of heterogeneous computing with the 2-CCD Zen4 3D V-Cache processors. This is also something that cannot be easily optimized for.


What about the server chips?


Server chips for both Intel and AMD remain sane for now. And I hope they stay that way since this is where the majority of HPC lives. And while there's room here, (in particular: Intel's Sapphire Rapids and AMD's Genoa-X with 3D V-Cache), they all remain far beyond my personal budget. So barring a sponsor or a donation, I'm unlikely to target these systems any time soon.


In the end, none of this matters a whole lot because of memory bandwidth. With or without AVX512, and with or without optimized code for each core type, memory bandwidth holds everything back. And this applies to both Intel and AMD. Thus from a developer perspective, it makes zero sense to go to hell and back to deal with heterogeneous computing when it won't matter much anyway. Furthermore, this heterogeneous computing revolution is different from the multi-core revolution of 2 decades ago in that parallel computing brought unbounded performance gains whereas heterogeneous computing can only squeeze out a (small) constant factor of speedup.


So rather than figuring out heterogeneous chips, most of the work has been research on memory bandwidth. y-cruncher's existing algorithms already have their space-time-tradoff sliders completely maxed out in the direction of reducing memory/bandwith at the cost of additional computation. So new stuff will be needed.


One of the biggest weaknesses in y-cruncher is its inability to fully utilize modern caches which are large, shared, and deeply hierarchical. So on paper, AMD shows the most potential for future improvement here because that massive 3D V-Cache is very much underutilized.