y-cruncher - A Multi-Threaded Pi-Program |
||
From a high-school project that went a little too far...By Alexander J. Yee |
(Last updated: April 3, 2024)
Shortcuts:
The first scalable multi-threaded Pi-benchmark for multi-core systems...
How fast can your computer compute Pi?
y-cruncher is a program that can compute Pi and other constants to trillions of digits.
It is the first of its kind that is multi-threaded and scalable to multi-core systems. Ever since its launch in 2009, it has become a common benchmarking and stress-testing application for overclockers and hardware enthusiasts.
y-cruncher has been used to set several world records for the most digits of Pi ever computed.
Current Release:
Windows: Version 0.8.4 Build 9538 (Released: February 22, 2024)
Linux : Version 0.8.4 Build 9538 (Released: February 22, 2024)
Official Mersenneforum Subforum.
Official HWBOT forum thread.
Countering the Compiler Regression with Optimizations: (April 3, 2024) - permalink
These kind of topics are hard to write about since it's not all positive. But let's start with a table because everyone hates walls of text:
Processor | Architecture | Clock Speeds | Binary | ISA | 1 Billion Digits of Pi (times in seconds) | |||||
v0.8.4 | v0.8.5 (ICC) | v0.8.5 (ICX) | v0.8.4 -> v0.8.5 | ICC -> ICX | Overall | |||||
Core i7 920 | Intel Nehalem | 3.5 GHz + 3 x 1333 MT/s | 08-NHM | x64 SSE4.1 | 535.818 | 492.971 | 482.982 | +8.00% | +2.03% | +9.86% |
Core i7 3630QM | Intel Ivy Bridge | stock + 2 x 1600 MT/s | 11-SNB | x64 AVX | 339.96 | 318.037 | 305.360 | +6.45% | +3.99% | +10.18% |
FX-8350 | AMD Piledriver | stock + 2 x 1600 MT/s | 12-BD2 | x64 FMA3 | 225.749 | 218.338 | 216.159 | +3.28% | +1.00% | +4.25% |
Core i7 5960X | Intel Haswell | 4.0 GHz + 4 x 2400 MT/s | 13-HSW | x64 AVX2 | 49.441 | 48.568 | 50.205 | +1.77% | -3.37% | -1.55% |
Core i7 6820HK | Intel Skylake | stock + 2 x 2133 MT/s | 14-BDW | x64 AVX2 + ADX | 102.144 | 100.887 | 103.570 | +1.23% | -2.66% | -1.40% |
Ryzen 7 1800X | AMD Zen 1 | stock + 2 x 2866 MT/s | 17-ZN1 | x64 AVX2 + ADX | 77.505 | 75.965 | 76.800 | +1.99% | -1.10% | +0.91% |
Core i9 7940X | Intel Skylake X | 3.6 GHz (AVX512) + 4 x 3466 MT/s | 17-SKX | x64 AVX512-DQ | 20.686 | 19.912 | 20.428 | +3.74% | -2.59% | +1.25% |
Ryzen 9 3950X | AMD Zen 2 | stock + 2 x 2666 MT/s | 19-ZN2 | x64 AVX2 + ADX | 34.814 | 33.292 | 33.161 | +4.37% | +0.39% | +4.75% |
Core i7 11800H | Intel Tiger Lake | stock + 2 x 3200 MT/s | 18-CNL | x64 AVX512-VBMI | 35.739 | 34.438 | 35.052 | +3.64% | -1.78% | +1.92% |
Ryzen 9 7950X | AMD Zen 4 | stock + 2 x 5000 MT/s | 22-ZN4 | x64 AVX512-GFNI | 18.978 | 18.848 | 18.937 | +0.69% | -0.47% | +0.22% |
ICC is Intel's old Classic C++ Compiler (ICC). And ICX is Intel's new LLVM Compiler (ICX). So we can see that:
In short, Intel's new compiler is causing a performance regression in y-cruncher. In order to prevent the next version of y-cruncher from actually getting slower, I am trying to offset the regressions with new performance optimizations - with only partial success so far.
But how did we get here?
Intel's classic C++ compiler has historically been the best compiler for code performance. However, starting from about 2020, Intel began migrating to a new LLVM-based compiler (ICX) which they wrapped up last year by discontinuing their old compiler (ICC). The problem is that for y-cruncher at least, ICX isn't actually better than ICC.
Processor | Architecture | Clock Speeds | Binary | ISA | BBP - 10 billionth Hex Digit of Pi (times in seconds) | |||
MSVC 17.7.1 | ICC 19.2 | ICX 2024 | ICC -> ICX | |||||
Core i7 920 | Intel Nehalem | 3.5 GHz | 08-NHM | x64 SSE4.1 | 568.384 | 574.745 | 725.910 | -26.30% |
Core i7 3630QM | Intel Ivy Bridge | stock | 11-SNB | x64 AVX | 525.811 | 436.337 | 464.628 | -6.48% |
FX-8350 | AMD Piledriver | stock | 12-BD2 | x64 FMA3 | 251.695 | 231.205 | 235.828 | -2.00% |
Core i7 5960X | Intel Haswell | 4.0 GHz | 13-HSW | x64 AVX2 | 55.249 | 50.640 | 53.422 | -5.49% |
Core i7 6820HK | Intel Skylake | stock | 14-BDW | x64 AVX2 | 107.977 | 105.307 | 108.959 | -3.47% |
Ryzen 7 1800X | AMD Zen 1 | stock | 17-ZN1 | x64 AVX2 | 97.809 | 97.915 | 95.269 | +2.70% |
Core i9 7940X | Intel Skylake X | 3.6 GHz (AVX512) | 17-SKX | x64 AVX512-DQ | 13.518 | 13.561 | 15.340 | -13.12% |
Ryzen 9 3950X | AMD Zen 2 | stock | 19-ZN2 | x64 AVX2 | 22.506 | 21.043 | 20.982 | +0.29% |
Core i7 11800H | Intel Tiger Lake | stock | 18-CNL | x64 AVX512-DQ | 50.002 | 50.798 | 51.654 | -1.69% |
Ryzen 9 7950X | AMD Zen 4 | stock | 22-ZN4 | x64 AVX512-DQ | 11.521 | 11.424 | 12.232 | -7.07% |
In other words, Intel got rid of their old compiler while their new compiler has yet to match it in performance. And because of the need to stay up-to-date with C++ features and CPU instruction sets, I cannot stay on an old compiler forever. Thus an "upgrade" is inevitable even if that hurts performance.
What about other compilers? If Intel's new compiler is bad, what about other alternatives? Well...
So even though Intel has made their compiler worse, it's still better than its competitors.
So why is Intel's new compiler worse than their old compiler?
There is no single regression in Intel's LLVM compiler that is responsible for the entire regression vs. their classic compiler. It's a combination of many regressions (and improvements) that collectively add up with the regressions winning in the end by several %. Anecdotally speaking, small regressions tended to involve inferior instruction selection and ordering while larger regressions tended to fall into these categories:
This last category is particularly nasty. Complicated loop optimizations like loop-interchange, loop fusion/fission, loop materialization, aggressive loop unrolling, etc... are only turned on at maximum optimization level (O3) for most compilers due to their high risk of backfiring. However, I've observed that most of these are already enabled at O1 and O2 for ICX and are difficult or impossible to disable. And when such optimizations backfire, it can kill performance of the loop easily by a factor of 3x or more.
Below are some pseudo-code examples illustrating major ways that I have observed ICX loop optimizations to backfire. Actual code that experiences such behavior are generally much larger and more complicated in size. Self-contained samples have been provided to Intel's engineers in the hope that they can improve their compiler.
Example 1: Loop Fusion Gone Bad
double* A = ...;
for (size_t c = 0; c < length; c++){
double tmp = A[c];
// Long dependency chain.
A[c] = tmp;
}
for (size_t c = 0; c < length; c++){
double tmp = A[c];
// Long dependency chain.
A[c] = tmp;
}
In this example, the iterations of each loop are all independent and can be run in parallel. But within each iteration is a long dependency chain. In order to keep the iterations within the CPU reorder window, the work is intentionally split into multiple loops (more than just 2 loops as shown here). This allows the CPU to reorder across iterations - thus allowing instruction level parallelism.
However, ICX doesn't always allow this to happen. Instead, it sometimes decides to undo my hand optimization by fusing the loops back together into this:
for (size_t c = 0; c < length; c++){
double tmp = A[c];
// Super long dependency chain.
A[c] = tmp;
}
While this improves memory locality by traversing the array only once instead of twice. It has increased the dependency chain to the point that the CPU is no longer able to sufficiently reorder across iterations. Thus it kills instruction-level parallelism (ILP) and hurts performance. The compiler may be incorrectly assuming that the dataset does not fit in cache when in fact it does.
The same situation can happen with loop-interchange where ICX will interchange loops to improve memory locality at the cost of creating dependency chains that wipe out instruction level parallelism.
Example 2: Everything Blows Up
This example is a pathologically bad case where Loop-invariant Code Motion (LICM) and loop unrolling combine to create a perfect storm that simultaneously blows up instruction cache, data cache, and performance. While it looks rather specific, it is nevertheless a common pattern in y-cruncher.
Here the code is iterating an array of AVX512 vectors that uses 1000 scalar weights. Each time a scalar weight is used, it is broadcast to a full vector to operate on the array A. In AVX512, a scalar broadcast has the same cost as a full vector load. So there is no added cost of redoing the broadcast in the inner loop.
const double* weights = ...;
__m512d* A = ...;
for (size_t c = 0; c < length; c++){
__m512d tmp = A[c];
for (size_t w = 0; w < 1000; w++){
__m512d weight =
_mm512_set1_pd(weights[w]);
// Do something with "tmp" and "weight".
}
A[c] = tmp;
}
Instead, ICX has a tendency to turn it into the following:
const double* weights = ...;
__m512d* A = ...;
__m512d expanded_weights0 = _mm512_set1_pd(weights[0]);
__m512d expanded_weights1 = _mm512_set1_pd(weights[1]);
__m512d expanded_weights2 = _mm512_set1_pd(weights[2]);
...
__m512d expanded_weights999 = _mm512_set1_pd(weights[999]);
for (size_t c = 0; c < length; c++){
__m512d tmp = A[c];
// Do something with "tmp" and "expanded_weights0".
// Do something with "tmp" and "expanded_weights1".
// Do something with "tmp" and "expanded_weights2".
// ...
// Do something with "tmp" and "expanded_weights999".
A[c] = tmp;
}
What was supposed to be a bunch of (free) scalar broadcasts has turned into 64 KB of stack usage and two fully unrolled 1000-iteration loops - one of which is completely useless. In this example, this transformation is never beneficial as broadcast loads are already free to begin with. So replacing them with stack spills and trashing both the data and instruction caches only makes things worse. For small values of length, this transformation is devastating to performance due to the initial setup.
So what happened?
To put it simply, other compilers do not do this kind of stuff. Or at least they have limits to prevent this from happening. ICX appears to be completely unrestrained.
A common theme among ICX misoptimizations is that Loop Invariant Code Motion (LICM) and Common Subexpression Elimination (CSE) will create additional live values that end up getting spilled to the stack, thus invoking a penalty that is often larger than the initial savings. The example above is a cherry-picked example where ICX takes this concept to the extreme resulting in an avalanche of secondary regressions such as misalignment penalties and cache pollution.
Conclusion:
Intel's LLVM compiler is undoubtly a very powerful compiler. And the more I study it, the more I am impressed with its ability. However, with power comes responsibility, and unfortunately I cannot say that ICX wields this power well. I have yet to investigate if these issues are in LLVM itself or in Intel's modifications to it. But regardless, as of today, Intel's LLVM compiler can be best described as a child running with scissors - young and reckless with dangerous tools.
How long will it take for ICX to reach ICC's qualty of code generation? I have no idea. And after waiting more than a year for this to happen, I've decided that it's probably not going to happen for a very long time. For every thing that ICX screws up, it probably gets 5 others right. But for code that has already been hand-optimized, getting it right is neutral while getting it wrong hurts a lot. Dropping down to assembly is not an option because there are "thousands" of distinct kernels which are largely generated via template metaprogramming.
Is y-cruncher the only application affected like this? Probably not.
y-cruncher has been used to set a number of world record sized computations.
Blue: Current World Record
Green: Former World Record
Red: Unverified computation. Does not qualify as a world record until verified using an alternate formula.
Date Announced | Date Completed: | Source: | Who: | Constant: | Decimal Digits: | Time: | Computer: |
March 14, 2024 | February 27, 2024 | Source | Jordan Ranous Kevin O’Brien Brian Beeler (StorageReview) |
Pi | 105,000,000,000,000 | 2 x AMD Epyc 9754 |
|
February 13, 2024 | February 12, 2024 | Jordan Ranous | Log(2) | 3,000,000,000,000 | 2 x Intel Xeon Platinum 8460H 512 GB |
||
January 17, 2024 | January 10, 2023 | Mamdouh Barakat | Zeta(5) | 250,000,000,000 |
Not Verified |
Intel Xeon Gold 5412U 125 GB |
|
January 17, 2024 | December 12, 2023 | Jordan Ranous | Gamma(1/4) | 1,000,000,000,000 | 2 x Intel Xeon Platinum 8450H 512 GB |
||
December 26, 2023 | December 24, 2023 | Jordan Ranous | e | 35,000,000,000,000 | 2 x Intel Xeon Platinum 8460H |
||
December 26, 2023 | December 25, 2023 | Jordan Ranous | Square Root of 2 | 20,000,000,000,000 | Intel Xeon Platinum 8450H 512 GB Intel Xeon Platinum 8460H 512 GB |
||
December 26, 2023 | December 22, 2023 | Andrew Sun |
Zeta(3) - Apery's Constant | 2,020,569,031,595 | Compute: 5.61 days | Intel Xeon Platinum 8347C 505 GB Intel Xeon Platinum 8347C 507 GB |
|
December 18, 2023 | December 15, 2023 | Jordan Ranous | Gamma(1/3) | 1,000,000,000,000 | 2 x Intel Xeon Platinum 8450H 512 GB |
||
December 18, 2023 | December 11, 2023 | Jordan Ranous | Zeta(5) | 201,000,001,300 | 2 x AMD EPYC 9754 1.5 TB |
||
December 2, 2023 | November 27, 2023 | Jordan Ranous | Golden Ratio | 20,000,000,000,000 | AMD Epyc 9654 - 1.5 TB Intel Xeon Platinum 8450H |
||
September 9, 2023 | September 7, 2023 | Andrew Sun |
Euler-Mascheroni Constant | 1,337,000,000,000 | Intel Xeon Platinum 83470C 400 GB |
||
July 17, 2022 | July 15, 2022 | Seungmin Kim | Lemniscate | 1,200,000,000,100 | 2 x Intel Xeon Gold 6140 |
||
June 8, 2022 | March 21, 2022 | Emma Haruka Iwao | Pi | 100,000,000,000,000 | 128 vCPU Intel Ice Lake (GCP) |
||
March 14, 2022 | March 9, 2022 | Seungmin Kim | Catalan's Constant | 1,200,000,000,100 | Compute: 48.6 days | 2 x Intel Xeon Gold 6140 |
|
August 17, 2021 | August 14, 2021 | Source | UAS Grisons | Pi | 62,831,853,071,796 | Compute: 108 days Verify: 34.4 hours |
AMD Epyc 7542 1 TB 34 + 4 Hard Drives |
September 13, 2020 | September 6, 2020 | Seungmin Kim | Log(10) | 1,200,000,000,100 | 2 x Intel Xeon E5-2699 v3 756 GB 2 x Intel Xeon Gold 5220 754 GB |
||
January 29, 2020 | January 29, 2020 | Blog | Timothy Mullican | Pi | 50,000,000,000,000 | 4 x Intel Xeon E7-4880 v2 315 GB 48 Hard Drives |
|
March 14, 2019 | January 21, 2019 | Blogs |
Emma Haruka Iwao | Pi | 31,415,926,535,897 | Compute: 121 days | 2 x Undisclosed Intel Xeon > 1.40 TB DDR4 > 240 TB SSD |
November 15, 2016 | November 11, 2016 | Blog Sponsor |
Peter Trueb | Pi | 22,459,157,718,361 | Compute: 105 days | 4 x Xeon E7-8890 v3 1.25 TB DDR4 20 x 6 TB 7200 RPM Seagate |
October 8, 2014 | October 7, 2014 | Sandon Van Ness (houkouonchi) |
Pi | 13,300,000,000,000 | 2 x Xeon E5-4650L 192 GB DDR3 @ 1333 MHz 24 x 4 TB + 30 x 3 TB |
||
December 28, 2013 | December 28, 2013 | Source | Shigeru Kondo | Pi | 12,100,000,000,050 | 2 x Xeon E5-2690 128 GB DDR3 @ 1600 MHz 24 x 3 TB |
See the complete list including other notably large computations. If you want to set a record yourself, the rules are in that link.
The main computational features of y-cruncher are:
Latest Releases: (February 22, 2024)
Downloading any of these files constitutes as acceptance of the license agreement.
OS Download Link Size Windows
35.0 MB Linux (Static)
26.7 MB Linux (Dynamic)
19.0 MB
Downloads can also be found on GitHub. Use this if you prefer HTTPS.
The Linux version comes in both statically and dynamically linked versions. The static version should work on most Linux distributions, but lacks TBB and NUMA binding. The dynamic version supports all features, but is less portable due to the DLL dependency hell.
HWBOT submission is back with this release. So I expect the leaderboards to be rewritten soon.
System Requirements:
Windows:
- Windows 7 or later.
- The HWBOT submitter requires the Java 8 Runtime.
Linux:
- 64-bit Linux is required. There is no support for 32-bit.
- The dynamic version has been tested on Ubuntu 22.04.
All Systems:
- An x86 or x64 processor.
Very old systems that don't meet these requirements may be able to run older versions of y-cruncher. Support goes all the way back to even before Windows XP.
Version History:
Other Downloads (for C++ programmers):
Advanced Documentation:
Comparison Chart: (Last updated: July 11, 2023)
Computations of Pi to various sizes. All times in seconds. All computations done entirely in ram.
The timings include the time needed to convert the digits to decimal representation, but not the time needed to write out the digits to disk.
Blue: Benchmarks are up-to-date with the latest version of y-cruncher.
Green: Benchmarks were done with an old version of y-cruncher that is comparable in performance with the current release.
Red: Benchmarks are significantly out-of-date due to being run with an old version of y-cruncher that is no longer comparable with the current release.
Purple: Benchmarks are from unreleased internal builds that are not speed comparable with the current release.
Laptops + Low-Power:
Processor(s): | Core i7 6820HK | Core i7 11800H | Core i7 11800H |
Generation: | Intel Skylake | Intel Tiger Lake | Intel Tiger Lake |
Cores/Threads: | 4/8 | 8/16 | 8/16 |
Processor Speed: | 3.2 GHz (stock) | ~2.5 GHz (45W PL) | ~3.0 GHz (60W PL) |
Memory: | 64 GB @ 2133 MT/s | 64 GB @ 3200 MT/s | 64 GB @ 3200 MT/s |
Version: | v0.8.1 (14-BDW) | v0.8.1 (18-CNL) | v0.8.1 (18-CNL) |
Instruction Set: | x64 AVX2 + ADX | x64 AVX512-VBMI | x64 AVX512-VBMI |
25,000,000 | 1.500 | 0.655 | 0.530 |
50,000,000 | 3.307 | 1.406 | 1.125 |
100,000,000 | 7.238 | 3.005 | 2.447 |
250,000,000 | 20.596 | 8.576 | 6.855 |
500,000,000 | 45.967 | 19.747 | 15.356 |
1,000,000,000 | 102.885 | 42.727 | 34.308 |
2,500,000,000 | 290.824 | 123.523 | 96.918 |
5,000,000,000 | 640.506 | 247.705 | 218.782 |
10,000,000,000 | 1,391.204 | 526.212 | 480.197 |
Credit: |
Processor(s): | Core i3 8121U | Core i7 11800H | ||||
Generation: | Intel Cannon Lake | Intel Tiger Lake | ||||
Cores/Threads: | 2/4 | 8/16 | ||||
Processor Speed: | ~2.5 - 3.2 GHz (stock) | ~2.5 - 2.8 GHz (45W PL) | ||||
Memory: | 8 GB @ 2400 MT/s | 64 GB @ 3200 MT/s | ||||
Version: | v0.8.1 (14-BDW) | v0.8.1 (17-SKX) | v0.8.1 (18-CNL) | v0.8.1 (14-BDW) | v0.8.1 (17-SKX) | v0.8.1 (18-CNL) |
Instruction Set: | x64 AVX2 + ADX | x64 AVX512-DQ | x64 AVX512-VBMI | x64 AVX2 + ADX | x64 AVX512-DQ | x64 AVX512-VBMI |
25,000,000 | 2.857 | 2.467 | 1.988 | 0.907 | 0.853 | 0.655 |
50,000,000 | 6.446 | 5.501 | 4.392 | 2.075 | 1.862 | 1.406 |
100,000,000 | 14.335 | 12.257 | 9.490 | 4.176 | 3.749 | 3.005 |
250,000,000 | 42.566 | 36.204 | 27.137 | 12.014 | 10.705 | 8.576 |
500,000,000 | 99.040 | 85.443 | 64.359 | 28.805 | 24.123 | 19.747 |
1,000,000,000 | 228.863 | 198.405 | 151.605 | 63.898 | 55.264 | 42.727 |
2,500,000,000 | 187.882 | 148.423 | 123.523 | |||
5,000,000,000 | 375.130 | 327.776 | 247.705 | |||
10,000,000,000 | 794.573 | 709.606 | 526.212 | |||
Credit: |
Mainstream Desktops:
Processor(s): | Ryzen 5 7600 | Core i9 11700K | Ryzen 9 3950X | Ryzen 9 5950X | Intel Core i9 13900KS | Ryzen 9 7950X | |
Generation: | AMD Zen 4 | Intel Rocket Lake | AMD Zen 2 | AMD Zen 3 | Intel Raptor Lake | AMD Zen 4 | |
Cores/Threads: | 6/12 | 8/16 | 16/32 | 16/32 | 24/32 | 16/32 | |
Processor Speed: | stock | stock | stock | 5.7/4.5 GHz | stock | ||
Memory: | 32 GB | 32 GB - 3200 MT/s | 128 GB - 2666 MT/s | 64 GB - 3200 MT/s | 96 GB - 8000 MT/s | 128 GB - 4400 MT/s | 128 GB - 5200 MT/s |
Program Version: | v0.8.1 (22-ZN4) | v0.8.1 (18-CNL) | v0.8.1 (19-ZN2) | v0.8.1 (19-ZN2) | v0.8.1 (14-BDW) | v0.8.1 (22-ZN4) | |
Instruction Set: | x64 AVX512-GFNI | x64 AVX512-VBMI | x64 AVX2 + ADX | x64 AVX2 + ADX | x64 AVX2 + ADX | x64 AVX512-GFNI | |
25,000,000 | 0.439 | 0.501 | 0.588 | 0.490 | 0.241 | 0.312 | 0.307 |
50,000,000 | 1.114 | 1.257 | 1.090 | 0.525 | 0.679 | 0.654 | |
100,000,000 | 2.223 | 2.685 | 2.345 | 1.132 | 1.517 | 1.410 | |
250,000,000 | 6.220 | 7.251 | 6.371 | 3.185 | 4.157 | 3.820 | |
500,000,000 | 13.378 | 13.573 | 15.556 | 13.395 | 7.065 | 8.883 | 8.062 |
1,000,000,000 | 29.497 | 30.415 | 33.925 | 29.301 | 15.901 | 18.542 | 17.039 |
2,500,000,000 | 83.421 | 86.119 | 96.695 | 82.204 | 44.888 | 50.743 | 46.467 |
5,000,000,000 | 181.647 | 193.718 | 215.333 | 181.355 | 99.566 | 110.379 | 101.345 |
10,000,000,000 | 473.958 | 399.012 | 241.162 | 220.522 | |||
25,000,000,000 | 1,361.732 | 680.344 | 623.493 | ||||
Credit: | Joel Rufin | Oliver Kruse |
|
Oliver Kruse | 曾 铮 |
Processor(s): | Core i7 920 | FX-8350 | Core i7 4770K | Ryzen 7 1800X | Ryzen 7 3800X |
Generation: | Intel Nehalem | AMD Piledriver | Intel Haswell | AMD Zen 1 | AMD Zen 2 |
Cores/Threads: | 4/8 | 8/8 | 4/8 | 8/16 | 8/16 |
Processor Speed: | 3.5 GHz | stock | 4.0 GHz | stock | stock |
Memory: | 12 GB - 1333 MT/s | 32 GB - 1600 MT/s | 32 GB - 2133 MT/s | 64 GB - 2866 MT/s | 32 GB - 3600 MT/s |
Program Version: | v0.8.1 (08-NHM) | v0.8.1 (11-BD1) | v0.8.1 (13-HSW) | v0.8.1 (17-ZN1) | v0.8.1 (19-ZN2) |
Instruction Set: | x64 SSE4.1 | x64 FMA4 | x64 AVX2 | x64 AVX2 + ADX | x64 AVX2 + ADX |
25,000,000 | 7.032 | 3.677 | 1.546 | 1.150 | 0.654 |
50,000,000 | 17.174 | 7.703 | 3.259 | 2.527 | 1.415 |
100,000,000 | 36.164 | 16.576 | 6.987 | 5.555 | 3.028 |
250,000,000 | 105.789 | 46.597 | 19.588 | 15.760 | 8.404 |
500,000,000 | 236.096 | 103.165 | 43.197 | 34.659 | 18.440 |
1,000,000,000 | 531.676 | 230.780 | 96.845 | 78.690 | 41.097 |
2,500,000,000 | 669.594 | 274.336 | 220.278 | 117.788 | |
5,000,000,000 | 1,460.714 | 606.605 | 493.388 | 266.719 | |
10,000,000,000 | 1,078.187 | ||||
25,000,000,000 | |||||
Credit: | Oliver Kruse |
High-End Desktops:
Processor(s): | Core i7 5960X | Threadripper 1950X | Core i9 7900X | Core i9 7940X | Threadripper 3990X | Xeon W7-2495X | Xeon W9-3475X |
Generation: | Intel Haswell | AMD Zen 1 | Intel Skylake X | Intel Skylake X | AMD Zen 2 | Intel Sapphire Rapids | Intel Sapphire Rapids |
Cores/Threads: | 8/16 | 16/32 | 10/20 | 14/28 | 64/128 | 24/48 | 36/72 |
Processor Speed: | 4.0 GHz | stock | ~3.6 GHz (200W PL) | 3.6 GHz (AVX512) | 2.9 GHz | 4.1-4.9 GHz | 4.2-4.9 GHz |
Memory: | 64 GB - 2400 MT/s | 64 GB - 2800 MT/s | 128 GB - 3000 MT/s | 128 GB - 3466 MT/s | ~141 GB - 2666 MT/s | 64 GB - 6400 MT/s | 128 GB - 6400 MT/s |
Program Version: | v0.8.1 (13-HSW) | v0.8.1 (17-ZN1) | v0.8.1 (17-SKX) | v0.8.1 (17-SKX) | v0.8.1 (19-ZN2) | v0.8.1 (18-CNL) | v0.8.3 (18-CNL) |
Instruction Set: | x64 AVX2 | x64 AVX2 + ADX | x64 AVX512-DQ | x64 AVX512-DQ | x64 AVX2 + ADX | x64 AVX512-VBMI | x64 AVX512-VBMI |
25,000,000 | 0.807 | 0.756 | 0.522 | 0.404 | 0.584 | 0.170 | 0.201 |
50,000,000 | 1.743 | 1.579 | 1.028 | 0.721 | 1.181 | 0.340 | 0.321 |
100,000,000 | 3.647 | 3.273 | 2.048 | 1.451 | 2.409 | 0.726 | 0.586 |
250,000,000 | 10.088 | 8.990 | 5.752 | 4.056 | 5.724 | 2.068 | 1.413 |
500,000,000 | 22.075 | 19.604 | 12.830 | 9.017 | 10.881 | 4.588 | 2.627 |
1,000,000,000 | 49.232 | 43.014 | 28.906 | 20.518 | 21.496 | 10.190 | 5.924 |
2,500,000,000 | 139.404 | 121.645 | 82.764 | 60.636 | 58.009 | 28.881 | 16.345 |
5,000,000,000 | 311.388 | 271.983 | 186.233 | 137.906 | 126.513 | 64.158 | 36.139 |
10,000,000,000 | 669.736 | 613.450 | 401.820 | 302.121 | 274.050 | 124.826 | 78.816 |
25,000,000,000 | 1,125.775 | 843.498 | 768.212 | 225.482 | |||
Credit: | Oliver Kruse | Paul Underwood | 曾 铮 |
Multi-Processor Workstation/Servers:
Due to high core count and the effect of NUMA (Non-Uniform Memory Access), performance on multi-processor systems are extremely sensitive to various settings. Therefore, these benchmarks may not be entirely representative of what the hardware is capable of.
Processor(s): | Xeon Platinum 8375C (AWS x2iedn.32xlarge) |
Xeon Platinum 8488C (AWS m7i.48xlarge) |
Epyc 9R14
(AWS m7a.48xlarge) |
Epyc 9R14
(AWS hpc7a.96xlarge) |
Epyc 9754 | |
Generation: | Intel Sapphire Rapids | Intel Sapphire Rapids | AMD Genoa | AMD Bergamo | ||
Cores/Threads: | 64/128 | 96/192 | 192/192 | 128/256 | 128/128 | |
Processor Speed: | 2.9 GHz | 2.4 GHz | 2.6 GHz | 2.25 - 3.1 GHz | ||
Memory: | 4 TB | 744 GB | 740 GB | 768 GB - 4800 MT/s | ||
Program Version: | v0.8.1 (18-CNL) | v0.8.1 (18-CNL) | v0.8.1 (22-ZN4) | v0.8.1 (22-ZN4) | ||
Instruction Set: | x64 AVX512-VBMI | x64 AVX512-VBMI | x64 AVX512-GFNI | x64 AVX512-GFNI | ||
25,000,000 | 0.250 | 0.163 | 0.216 | 0.213 | 0.245 | 0.229 |
50,000,000 | 0.454 | 0.289 | 0.285 | 0.279 | 0.350 | 0.433 |
100,000,000 | 0.844 | 0.531 | 0.642 | 0.635 | 0.853 | 0.876 |
250,000,000 | 1.976 | 1.288 | 1.776 | 1.716 | 2.224 | 2.133 |
500,000,000 | 3.794 | 2.499 | 3.728 | 3.621 | 4.186 | 3.850 |
1,000,000,000 | 7.650 | 5.149 | 6.547 | 6.265 | 7.063 | 6.495 |
2,500,000,000 | 20.425 | 13.633 | 13.554 | 12.500 | 15.338 | 14.477 |
5,000,000,000 | 45.675 | 29.655 | 25.334 | 22.377 | 29.072 | 28.133 |
10,000,000,000 | 101.468 | 64.026 | 51.134 | 44.059 | 58.797 | 59.007 |
25,000,000,000 | 297.622 | 182.920 | 140.286 | 120.282 | 156.797 | 164.281 |
50,000,000,000 | 678.016 | 410.842 | 321.970 | 275.297 | 350.391 | 368.548 |
100,000,000,000 | 1,549.991 | 943.182 | 771.266 | 672.558 | 829.957 | 853.717 |
250,000,000,000 | 4,488.317 | |||||
500,000,000,000 | 9,685.971 | |||||
Credit: | Greg Hogan | Tim Wesley |
Processor(s): | Xeon Platinum 8124M | Xeon Gold 6148 | Xeon Platinum 8175M | Xeon Platinum 8275CL | Epyc 7742 | Epyc 7B12 | Epyc 7742 |
Generation: | Intel Skylake Purley | Intel Skylake Purley | Intel Skylake Purley | Intel Cascade Lake | AMD Rome | AMD Rome | AMD Rome |
Sockets/Cores/Threads: | 2/36/72 | 2/40/40 | 2/48/96 | 2/48/96 | 2/128/256 | 2/112/224 | 2/128/256 |
Processor Speed: | 3.0 GHz | 2.4 GHz | 2.5 GHz | 3.0 GHz | 2.25 GHz | 2.25 GHz | |
Memory: | 137 GB - ?? | 188 GB - ?? | ~756 GB - ?? | 192 GB | ~504 GB | ~882 GB | 2 TB |
Program Version: | v0.7.5 (17-SKX) | v0.7.6 (17-SKX) | v0.7.6 (17-SKX) | v0.7.8 (17-SKX) | v0.7.7 (17-ZN1) | v0.7.8 (19-ZN2) | v0.7.8 (19-ZN2) |
Instruction Set: | x64 AVX512-DQ | x64 AVX512-DQ | x64 AVX512-DQ | x64 AVX512-DQ | x64 AVX2 + ADX | x64 AVX2 + ADX | x64 AVX2 + ADX |
25,000,000 | 0.540 | 0.329 | 0.294 | 0.283 | 0.534 | 0.439 | 0.513 |
50,000,000 | 0.981 | 0.683 | 0.617 | 0.544 | 1.027 | 0.838 | 0.920 |
100,000,000 | 1.905 | 1.456 | 1.305 | 1.169 | 2.298 | 1.796 | 1.887 |
250,000,000 | 5.085 | 3.737 | 3.591 | 3.125 | 5.854 | 4.509 | 4.650 |
500,000,000 | 10.372 | 7.750 | 7.293 | 6.309 | 10.502 | 8.196 | 8.066 |
1,000,000,000 | 21.217 | 16.550 | 15.041 | 13.042 | 17.836 | 14.252 | 13.246 |
2,500,000,000 | 55.701 | 45.693 | 39.329 | 34.028 | 35.485 | 30.592 | 27.011 |
5,000,000,000 | 118.151 | 99.078 | 83.601 | 71.777 | 62.432 | 58.405 | 49.940 |
10,000,000,000 | 247.928 | 212.984 | 176.695 | 153.169 | 115.543 | 116.900 | 98.156 |
25,000,000,000 | 599.653 | 491.988 | 425.442 | 307.995 | 314.907 | 258.081 | |
50,000,000,000 | 1,081.181 | 690.662 | 741.633 | 598.716 | |||
100,000,000,000 | 1715.123 | 1,370.714 | |||||
250,000,000,000 | 3,872.397 | ||||||
Credit: | Jacob Coleman | Oliver Kruse | newalex | Xinyu Miao | Carsten Spille | Greg Hogan | Song Pengei |
Processor(s): | Xeon E5-2683 v3 | Xeon E7-8880 v3 | Xeon E5-2687W v4 | Xeon E5-2686 v4 | Xeon E5-2696 v4 | Epyc 7601 | Xeon Gold 6130F |
Generation: | Intel Haswell | Intel Haswell | Intel Broadwell | Intel Broadwell | Intel Broadwell | AMD Naples | Intel Skylake Purley |
Sockets/Cores/Threads: | 2/28/56 | 4/64/128 | 2/24/48 | 2/36/72 | 2/44/88 | 2/64/128 | 2/32/64 |
Processor Speed: | 2.03 GHz | 2.3 GHz | 3.0 GHz | 2.3 GHz | 2.2 GHz | 2.2 GHz | 2.1 GHz |
Memory: | 128 GB - ??? | 2 TB - ??? | 64 GB | 504 GB - ??? | 768 GB - ??? | 256 GB - ?? | 256 GB - ?? |
Program Version: | v0.6.9 (13-HSW) | v0.7.1 (13-HSW) | v0.7.6 (14-BDW) | v0.7.7 (14-BDW) | v0.7.1 (14-BDW) | v0.7.3 (17-ZN1) | v0.7.3 (17-SKX) |
Instruction Set: | x64 AVX2 | x64 AVX2 | x64 AVX2 + ADX | x64 AVX2 + ADX | x64 AVX2 + ADX | x64 AVX2 + ADX | x64 AVX512-DQ |
25,000,000 | 0.907 | 1.176 | 0.490 | 0.494 | 0.715 | 2.459 | 1.150 |
50,000,000 | 1.745 | 2.321 | 1.072 | 0.982 | 1.344 | 4.347 | 1.883 |
100,000,000 | 3.317 | 4.217 | 2.303 | 2.193 | 2.673 | 6.996 | 3.341 |
250,000,000 | 8.339 | 8.781 | 6.196 | 6.044 | 6.853 | 14.258 | 7.731 |
500,000,000 | 17.708 | 15.879 | 13.046 | 12.582 | 14.538 | 24.930 | 15.346 |
1,000,000,000 | 37.311 | 32.078 | 27.763 | 26.852 | 31.260 | 47.837 | 31.301 |
2,500,000,000 | 102.131 | 78.251 | 76.202 | 73.596 | 84.271 | 111.139 | 82.871 |
5,000,000,000 | 218.917 | 164.157 | 165.046 | 160.094 | 192.889 | 228.252 | 179.488 |
10,000,000,000 | 471.802 | 346.307 | 356.487 | 346.305 | 417.322 | 482.777 | 387.530 |
25,000,000,000 | 1,511.852 | 957.966 | 1,006.131 | 980.784 | 1,186.881 | 1,184.144 | 1,063.850 |
50,000,000,000 | 2,096.169 | 2,202.558 | 2,156.854 | 2,601.476 | |||
100,000,000,000 | 4,442.742 | 6,037.704 | |||||
250,000,000,000 | 17,428.450 | ||||||
Credit: | Shigeru Kondo | Jacob Coleman | Cameron Giesbrecht | newalex | "yoyo" | Dave Graham |
The full chart of rankings for each size can be found here:
These fastest times may include unreleased betas.
Got a faster time? Let me know: a-yee@u.northwestern.edu
Note that I usually do not respond to these emails. I simply put them into the charts which I update periodically (typically within 2 weeks).
Decimal Digits of Pi - Times in Seconds Core i9 7940X @ 3.7 GHz AVX512 |
||
Memory Frequency: | 2666 MT/s | 3466 MT/s |
25,000,000 | 0.839 | 0.758 |
50,000,000 | 1.424 | 1.338 |
100,000,000 | 2.701 | 2.425 |
250,000,000 | 6.489 | 5.877 |
500,000,000 | 13.307 | 11.917 |
1,000,000,000 | 27.913 | 24.915 |
2,500,000,000 | 76.837 | 68.322 |
5,000,000,000 | 168.058 | 148.737 |
10,000,000,000 | 365.047 | 322.115 |
25,000,000,000 | 1,037.527 | 916.039 |
High core count Skylake X processors are known to be heavily bottlenecked by memory bandwidth.
Memory Bandwidth:
Because of the memory-intensive nature of computing Pi and other constants, y-cruncher needs a lot of memory bandwidth to perform well. In fact, the program has been noticably memory bound on nearly all high-end desktops since 2012 as well as the majority of multi-socket systems since at least 2006.
Recommendations:
Don't be surprised if y-cruncher exposes instabilities that other applications and stress-tests do not. y-cruncher is unusual in that it simultaneously places a heavy load on both the CPU and the entire memory subsystem.
Parallel Performance:
y-cruncher has a lot of settings for tuning parallel performance. By default, it makes a best effort to analyze the hardware and pick the best settings. But because of the virtually unlimited combinations of processor topologies, it's difficult for y-cruncher to optimally pick the best settings for everything. So sometimes the best performance can only be achieved with manual settings.
*These are advanced settings that cannot be changed if you're using the benchmark option in the console UI. To change them, you will need to either run benchmark mode from the command line or use the custom compute menu.
Load imbalance is a faily common problem in y-cruncher. The usual causes are:
Large Pages:
Large pages used to not matter in the past, but they do now in the post-Spectre/Meltdown world. Mitigations for the Meltdown vulnerability can have a noticeable performance drop for y-cruncher (up to 5% has been observed). It turns out that turning on large pages can mitigate the penalty for this mitigation. (pun intended)
Refer to the memory allocation guide on how to turn on large pages.
Swap Mode:
This is probably one of the most complicated features in y-cruncher.
Everything in this section is in the process of being re-verified and moved to: https://github.com/Mysticial/y-cruncher/issues
Performance Issues:
Pi and other Constants:
Program Usage:
Hardware and Overclocking:
Academia:
Programming:
Other:
Here's some interesting sites dedicated to the computation of Pi and other constants:
Contact me via e-mail. I'm pretty good with responding unless it gets caught in my school's junk mail filter.
You can also find me on Twitter as @Mysticial.