y-cruncher - A Multi-Threaded Pi-Program

From a high-school project that went a little too far...

By Alexander J. Yee

(Last updated: April 3, 2024)

 

Shortcuts:

 

The first scalable multi-threaded Pi-benchmark for multi-core systems...

 

How fast can your computer compute Pi?

 

y-cruncher is a program that can compute Pi and other constants to trillions of digits.

It is the first of its kind that is multi-threaded and scalable to multi-core systems. Ever since its launch in 2009, it has become a common benchmarking and stress-testing application for overclockers and hardware enthusiasts.

 

y-cruncher has been used to set several world records for the most digits of Pi ever computed.

 

Current Release:

Windows: Version 0.8.4 Build 9538 (Released: February 22, 2024)

Linux      : Version 0.8.4 Build 9538 (Released: February 22, 2024)

 

Official Mersenneforum Subforum.

Official HWBOT forum thread.

 

News:

 

Countering the Compiler Regression with Optimizations: (April 3, 2024) - permalink

 

These kind of topics are hard to write about since it's not all positive. But let's start with a table because everyone hates walls of text:

Processor Architecture Clock Speeds Binary ISA 1 Billion Digits of Pi (times in seconds)
v0.8.4 v0.8.5 (ICC) v0.8.5 (ICX) v0.8.4 -> v0.8.5 ICC -> ICX Overall
Core i7 920 Intel Nehalem 3.5 GHz + 3 x 1333 MT/s 08-NHM x64 SSE4.1 535.818 492.971 482.982 +8.00% +2.03% +9.86%
Core i7 3630QM Intel Ivy Bridge stock + 2 x 1600 MT/s 11-SNB x64 AVX 339.96 318.037 305.360 +6.45% +3.99% +10.18%
FX-8350 AMD Piledriver stock + 2 x 1600 MT/s 12-BD2 x64 FMA3 225.749 218.338 216.159 +3.28% +1.00% +4.25%
Core i7 5960X Intel Haswell 4.0 GHz + 4 x 2400 MT/s 13-HSW x64 AVX2 49.441 48.568 50.205 +1.77% -3.37% -1.55%
Core i7 6820HK Intel Skylake stock + 2 x 2133 MT/s 14-BDW x64 AVX2 + ADX 102.144 100.887 103.570 +1.23% -2.66% -1.40%
Ryzen 7 1800X AMD Zen 1 stock + 2 x 2866 MT/s 17-ZN1 x64 AVX2 + ADX 77.505 75.965 76.800 +1.99% -1.10% +0.91%
Core i9 7940X Intel Skylake X 3.6 GHz (AVX512) + 4 x 3466 MT/s 17-SKX x64 AVX512-DQ 20.686 19.912 20.428 +3.74% -2.59% +1.25%
Ryzen 9 3950X AMD Zen 2 stock + 2 x 2666 MT/s 19-ZN2 x64 AVX2 + ADX 34.814 33.292 33.161 +4.37% +0.39% +4.75%
Core i7 11800H Intel Tiger Lake stock + 2 x 3200 MT/s 18-CNL x64 AVX512-VBMI 35.739 34.438 35.052 +3.64% -1.78% +1.92%
Ryzen 9 7950X AMD Zen 4 stock + 2 x 5000 MT/s 22-ZN4 x64 AVX512-GFNI 18.978 18.848 18.937 +0.69% -0.47% +0.22%

ICC is Intel's old Classic C++ Compiler (ICC). And ICX is Intel's new LLVM Compiler (ICX). So we can see that:

  1. y-cruncher v0.8.5 will have new software optimizations that improves performance on all processors.
  2. Intel's new compiler (ICX) is worse than their old compiler (ICC) on nearly all modern processors.

In short, Intel's new compiler is causing a performance regression in y-cruncher. In order to prevent the next version of y-cruncher from actually getting slower, I am trying to offset the regressions with new performance optimizations - with only partial success so far.

 

 

But how did we get here?

 

Intel's classic C++ compiler has historically been the best compiler for code performance. However, starting from about 2020, Intel began migrating to a new LLVM-based compiler (ICX) which they wrapped up last year by discontinuing their old compiler (ICC). The problem is that for y-cruncher at least, ICX isn't actually better than ICC.

Processor Architecture Clock Speeds Binary ISA BBP - 10 billionth Hex Digit of Pi (times in seconds)
MSVC 17.7.1 ICC 19.2 ICX 2024 ICC -> ICX
Core i7 920 Intel Nehalem 3.5 GHz 08-NHM x64 SSE4.1 568.384 574.745 725.910 -26.30%
Core i7 3630QM Intel Ivy Bridge stock 11-SNB x64 AVX 525.811 436.337 464.628 -6.48%
FX-8350 AMD Piledriver stock 12-BD2 x64 FMA3 251.695 231.205 235.828 -2.00%
Core i7 5960X Intel Haswell 4.0 GHz 13-HSW x64 AVX2 55.249 50.640 53.422 -5.49%
Core i7 6820HK Intel Skylake stock 14-BDW x64 AVX2 107.977 105.307 108.959 -3.47%
Ryzen 7 1800X AMD Zen 1 stock 17-ZN1 x64 AVX2 97.809 97.915 95.269 +2.70%
Core i9 7940X Intel Skylake X 3.6 GHz (AVX512) 17-SKX x64 AVX512-DQ 13.518 13.561 15.340 -13.12%
Ryzen 9 3950X AMD Zen 2 stock 19-ZN2 x64 AVX2 22.506 21.043 20.982 +0.29%
Core i7 11800H Intel Tiger Lake stock 18-CNL x64 AVX512-DQ 50.002 50.798 51.654 -1.69%
Ryzen 9 7950X AMD Zen 4 stock 22-ZN4 x64 AVX512-DQ 11.521 11.424 12.232 -7.07%

In other words, Intel got rid of their old compiler while their new compiler has yet to match it in performance. And because of the need to stay up-to-date with C++ features and CPU instruction sets, I cannot stay on an old compiler forever. Thus an "upgrade" is inevitable even if that hurts performance.

 

What about other compilers? If Intel's new compiler is bad, what about other alternatives? Well...

So even though Intel has made their compiler worse, it's still better than its competitors.

 

 

So why is Intel's new compiler worse than their old compiler?

 

There is no single regression in Intel's LLVM compiler that is responsible for the entire regression vs. their classic compiler. It's a combination of many regressions (and improvements) that collectively add up with the regressions winning in the end by several %. Anecdotally speaking, small regressions tended to involve inferior instruction selection and ordering while larger regressions tended to fall into these categories:

This last category is particularly nasty. Complicated loop optimizations like loop-interchange, loop fusion/fission, loop materialization, aggressive loop unrolling, etc... are only turned on at maximum optimization level (O3) for most compilers due to their high risk of backfiring. However, I've observed that most of these are already enabled at O1 and O2 for ICX and are difficult or impossible to disable. And when such optimizations backfire, it can kill performance of the loop easily by a factor of 3x or more.

 

Below are some pseudo-code examples illustrating major ways that I have observed ICX loop optimizations to backfire. Actual code that experiences such behavior are generally much larger and more complicated in size. Self-contained samples have been provided to Intel's engineers in the hope that they can improve their compiler.

 

 

Example 1: Loop Fusion Gone Bad

 

    double* A = ...;

    for (size_t c = 0; c < length; c++){

        double tmp = A[c];

 

        //  Long dependency chain.

 

        A[c] = tmp;

    }

    for (size_t c = 0; c < length; c++){

        double tmp = A[c];

 

        //  Long dependency chain.

 

        A[c] = tmp;

    }

In this example, the iterations of each loop are all independent and can be run in parallel. But within each iteration is a long dependency chain. In order to keep the iterations within the CPU reorder window, the work is intentionally split into multiple loops (more than just 2 loops as shown here). This allows the CPU to reorder across iterations - thus allowing instruction level parallelism.

 

However, ICX doesn't always allow this to happen. Instead, it sometimes decides to undo my hand optimization by fusing the loops back together into this:

 

    for (size_t c = 0; c < length; c++){

        double tmp = A[c];

 

        //  Super long dependency chain.

 

        A[c] = tmp;

    }

While this improves memory locality by traversing the array only once instead of twice. It has increased the dependency chain to the point that the CPU is no longer able to sufficiently reorder across iterations. Thus it kills instruction-level parallelism (ILP) and hurts performance. The compiler may be incorrectly assuming that the dataset does not fit in cache when in fact it does.

 

The same situation can happen with loop-interchange where ICX will interchange loops to improve memory locality at the cost of creating dependency chains that wipe out instruction level parallelism.

 

 

Example 2: Everything Blows Up

 

This example is a pathologically bad case where Loop-invariant Code Motion (LICM) and loop unrolling combine to create a perfect storm that simultaneously blows up instruction cache, data cache, and performance. While it looks rather specific, it is nevertheless a common pattern in y-cruncher.

 

Here the code is iterating an array of AVX512 vectors that uses 1000 scalar weights. Each time a scalar weight is used, it is broadcast to a full vector to operate on the array A. In AVX512, a scalar broadcast has the same cost as a full vector load. So there is no added cost of redoing the broadcast in the inner loop.

 

    const double* weights = ...;

    __m512d* A = ...;

    for (size_t c = 0; c < length; c++){

        __m512d tmp = A[c];

 

        for (size_t w = 0; w < 1000; w++){

            __m512d weight = _mm512_set1_pd(weights[w]); //  Scalar Broadcast. Same cost as regular load.

            //  Do something with "tmp" and "weight".

        }

 

        A[c] = tmp;

    }

 

Instead, ICX has a tendency to turn it into the following:

 

    const double* weights = ...;

    __m512d* A = ...;

 

    __m512d expanded_weights0 = _mm512_set1_pd(weights[0]); //  Each of these is 64 bytes!

    __m512d expanded_weights1 = _mm512_set1_pd(weights[1]);

    __m512d expanded_weights2 = _mm512_set1_pd(weights[2]);

    ...

    __m512d expanded_weights999 = _mm512_set1_pd(weights[999]);

 

    for (size_t c = 0; c < length; c++){

        __m512d tmp = A[c];

 

        //  Do something with "tmp" and "expanded_weights0".

        //  Do something with "tmp" and "expanded_weights1".

        //  Do something with "tmp" and "expanded_weights2".

        //  ...

        //  Do something with "tmp" and "expanded_weights999".

 

        A[c] = tmp;

    }

What was supposed to be a bunch of (free) scalar broadcasts has turned into 64 KB of stack usage and two fully unrolled 1000-iteration loops - one of which is completely useless. In this example, this transformation is never beneficial as broadcast loads are already free to begin with. So replacing them with stack spills and trashing both the data and instruction caches only makes things worse. For small values of length, this transformation is devastating to performance due to the initial setup.

 

So what happened?

  1. The compiler first sees that the inner loop has a compile-time trip count. So it decides it can completely unroll it. I have never seen compilers completely unroll loops this large, but ICX apparently does it with several of y-cruncher's kernels.
  2. The compiler deduces that weights does not alias with A. Thus it sees that the loads and scalar broadcasts are loop invariant. So it pulls them out of the loop. Yes, all 1000 of them.
  3. Those 1000 values need to go somewhere right? So it spills them onto the stack (and also incurring any penalties due to stack misalignment).

To put it simply, other compilers do not do this kind of stuff. Or at least they have limits to prevent this from happening. ICX appears to be completely unrestrained.

 

A common theme among ICX misoptimizations is that Loop Invariant Code Motion (LICM) and Common Subexpression Elimination (CSE) will create additional live values that end up getting spilled to the stack, thus invoking a penalty that is often larger than the initial savings. The example above is a cherry-picked example where ICX takes this concept to the extreme resulting in an avalanche of secondary regressions such as misalignment penalties and cache pollution.

 

 

Conclusion:

 

Intel's LLVM compiler is undoubtly a very powerful compiler. And the more I study it, the more I am impressed with its ability. However, with power comes responsibility, and unfortunately I cannot say that ICX wields this power well. I have yet to investigate if these issues are in LLVM itself or in Intel's modifications to it. But regardless, as of today, Intel's LLVM compiler can be best described as a child running with scissors - young and reckless with dangerous tools.

 

How long will it take for ICX to reach ICC's qualty of code generation? I have no idea. And after waiting more than a year for this to happen, I've decided that it's probably not going to happen for a very long time. For every thing that ICX screws up, it probably gets 5 others right. But for code that has already been hand-optimized, getting it right is neutral while getting it wrong hurts a lot. Dropping down to assembly is not an option because there are "thousands" of distinct kernels which are largely generated via template metaprogramming.

 

Is y-cruncher the only application affected like this? Probably not.

 

 

 

Older News

 

Records Set by y-cruncher:

y-cruncher has been used to set a number of world record sized computations.

 

Blue: Current World Record

Green: Former World Record

Red: Unverified computation. Does not qualify as a world record until verified using an alternate formula.

Date Announced Date Completed: Source: Who: Constant: Decimal Digits: Time: Computer:
March 14, 2024 February 27, 2024 Source

Jordan Ranous

Kevin O’Brien

Brian Beeler

(StorageReview)

Pi 105,000,000,000,000

Compute:  75 days

Verify:  4 days

Validation File

2 x AMD Epyc 9754
1.5 TB
960 TB storage

February 13, 2024 February 12, 2024   Jordan Ranous Log(2) 3,000,000,000,000

Compute:  42.7 hours
Verify:  58.3 hours

2 x Intel Xeon Platinum 8460H
512 GB
January 17, 2024 January 10, 2023   Mamdouh Barakat Zeta(5) 250,000,000,000

Compute:  6.02 days

Not Verified

Intel Xeon Gold 5412U

125 GB

January 17, 2024 December 12, 2023   Jordan Ranous Gamma(1/4) 1,000,000,000,000

Compute:  22.6 hours

Verify:  22.8 hours

2 x Intel Xeon Platinum 8450H
512 GB
December 26, 2023 December 24, 2023   Jordan Ranous e 35,000,000,000,000

Compute:  94.5 hours

Verify:  92.5 hours

2 x Intel Xeon Platinum 8460H
512 GB

December 26, 2023 December 25, 2023   Jordan Ranous Square Root of 2 20,000,000,000,000

Compute:  29.2 hours

Verify:  21.6 hours

Intel Xeon Platinum 8450H
512 GB
Intel Xeon Platinum 8460H
512 GB
December 26, 2023 December 22, 2023  

Andrew Sun

Zeta(3) - Apery's Constant 2,020,569,031,595 Compute:  5.61 days

Verify:  5.93 days

Intel Xeon Platinum 8347C
505 GB
Intel Xeon Platinum 8347C
507 GB
December 18, 2023 December 15, 2023   Jordan Ranous Gamma(1/3) 1,000,000,000,000

Compute:  17.5 hours

Verify:  23.3 hours

2 x Intel Xeon Platinum 8450H

512 GB

December 18, 2023 December 11, 2023   Jordan Ranous Zeta(5) 201,000,001,300

Compute:  29.9 hours

Verify:  23.5 hours

2 x AMD EPYC 9754

1.5 TB

December 2, 2023 November 27, 2023   Jordan Ranous Golden Ratio 20,000,000,000,000

Compute:  76.1 hours

Verify:  30.0 hours

AMD Epyc 9654 - 1.5 TB

Intel Xeon Platinum 8450H

September 9, 2023 September 7, 2023  

Andrew Sun

Euler-Mascheroni Constant 1,337,000,000,000

Compute:  28.5 days

Verify:  41.3 days

Intel Xeon Platinum 83470C

400 GB

July 17, 2022 July 15, 2022   Seungmin Kim Lemniscate 1,200,000,000,100

Compute:  32.2 days

Verify:  46.5 days

2 x Intel Xeon Gold 6140
377 GB

June 8, 2022 March 21, 2022   Emma Haruka Iwao Pi 100,000,000,000,000

Compute:  158 days

Verify:  12.6 hours

Validation File

128 vCPU Intel Ice Lake (GCP)
864 GB
663 TB storage

March 14, 2022 March 9, 2022   Seungmin Kim Catalan's Constant 1,200,000,000,100 Compute:  48.6 days

Verify:  47.3 days

2 x Intel Xeon Gold 6140
2 x Intel Xeon E5-2680 v3

August 17, 2021 August 14, 2021 Source UAS Grisons Pi 62,831,853,071,796 Compute:  108 days
Verify:  34.4 hours
AMD Epyc 7542
1 TB
34 + 4 Hard Drives
September 13, 2020 September 6, 2020   Seungmin Kim Log(10) 1,200,000,000,100

Compute:  14.5 days

Verify:  22.5 days

2 x Intel Xeon E5-2699 v3
756 GB
2 x Intel Xeon Gold 5220
754 GB
January 29, 2020 January 29, 2020 Blog Timothy Mullican Pi 50,000,000,000,000

Compute:  303 days

Verify:  17.2 hours

Validation File

4 x Intel Xeon E7-4880 v2

315 GB

48 Hard Drives

March 14, 2019 January 21, 2019

Blogs

1 + 2

Emma Haruka Iwao Pi 31,415,926,535,897 Compute:  121 days

Verify:  20.0 hours

Validation File

2 x Undisclosed Intel Xeon
> 1.40 TB DDR4
> 240 TB SSD
November 15, 2016 November 11, 2016 Blog
Sponsor
Peter Trueb Pi 22,459,157,718,361 Compute:  105 days

Verify:  28 hours

Validation File

4 x Xeon E7-8890 v3
1.25 TB DDR4
20 x 6 TB 7200 RPM Seagate
October 8, 2014 October 7, 2014  

Sandon Van Ness

(houkouonchi)

Pi 13,300,000,000,000

Compute:  208 days

Verify:  182 hours

Validation File

2 x Xeon E5-4650L
192 GB DDR3 @ 1333 MHz
24 x 4 TB + 30 x 3 TB
December 28, 2013 December 28, 2013 Source Shigeru Kondo Pi 12,100,000,000,050

Compute: 94 days

Verify: 46 hours

2 x Xeon E5-2690
128 GB DDR3 @ 1600 MHz
24 x 3 TB

See the complete list including other notably large computations. If you want to set a record yourself, the rules are in that link.

 

 

Features:

 

The main computational features of y-cruncher are:

 

Download:

Sample Screenshot: 1 trillion digits of Pi

Core i7 5960X @ 4.0 GHz - 64 DDR4 @ 2400 MHz - 16 HDs

 

Latest Releases: (February 22, 2024)

Downloading any of these files constitutes as acceptance of the license agreement.

OS Download Link Size

Windows

y-cruncher v0.8.4.9538a.zip

35.0 MB

Linux (Static)

y-cruncher v0.8.4.9538-static.tar.xz

26.7 MB

Linux (Dynamic)

y-cruncher v0.8.4.9538-dynamic.tar.xz

19.0 MB

 

 

 

 

 

 

 

 

Downloads can also be found on GitHub. Use this if you prefer HTTPS.

 

The Linux version comes in both statically and dynamically linked versions. The static version should work on most Linux distributions, but lacks TBB and NUMA binding. The dynamic version supports all features, but is less portable due to the DLL dependency hell.

 

HWBOT submission is back with this release. So I expect the leaderboards to be rewritten soon.

 

System Requirements:

Windows:

Linux:

All Systems:

Very old systems that don't meet these requirements may be able to run older versions of y-cruncher. Support goes all the way back to even before Windows XP.

 

Version History:

 

Other Downloads (for C++ programmers):

 

Advanced Documentation:

 

 

Benchmarks:

Comparison Chart: (Last updated: July 11, 2023)

 

Computations of Pi to various sizes. All times in seconds. All computations done entirely in ram.

The timings include the time needed to convert the digits to decimal representation, but not the time needed to write out the digits to disk.

 

Blue: Benchmarks are up-to-date with the latest version of y-cruncher.

Green: Benchmarks were done with an old version of y-cruncher that is comparable in performance with the current release.

Red: Benchmarks are significantly out-of-date due to being run with an old version of y-cruncher that is no longer comparable with the current release.

Purple: Benchmarks are from unreleased internal builds that are not speed comparable with the current release.

 

 

Laptops + Low-Power:

Processor(s): Core i7 6820HK Core i7 11800H Core i7 11800H
Generation: Intel Skylake Intel Tiger Lake Intel Tiger Lake
Cores/Threads: 4/8 8/16 8/16
Processor Speed: 3.2 GHz (stock) ~2.5 GHz (45W PL) ~3.0 GHz (60W PL)
Memory: 64 GB @ 2133 MT/s 64 GB @ 3200 MT/s 64 GB @ 3200 MT/s
Version: v0.8.1 (14-BDW) v0.8.1 (18-CNL) v0.8.1 (18-CNL)
Instruction Set: x64 AVX2 + ADX x64 AVX512-VBMI x64 AVX512-VBMI
25,000,000 1.500 0.655 0.530
50,000,000 3.307 1.406 1.125
100,000,000 7.238 3.005 2.447
250,000,000 20.596 8.576 6.855
500,000,000 45.967 19.747 15.356
1,000,000,000 102.885 42.727 34.308
2,500,000,000 290.824 123.523 96.918
5,000,000,000 640.506 247.705 218.782
10,000,000,000 1,391.204 526.212 480.197
Credit:      
Processor(s): Core i3 8121U Core i7 11800H
Generation: Intel Cannon Lake Intel Tiger Lake
Cores/Threads: 2/4 8/16
Processor Speed: ~2.5 - 3.2 GHz (stock) ~2.5 - 2.8 GHz (45W PL)
Memory: 8 GB @ 2400 MT/s 64 GB @ 3200 MT/s
Version: v0.8.1 (14-BDW) v0.8.1 (17-SKX) v0.8.1 (18-CNL) v0.8.1 (14-BDW) v0.8.1 (17-SKX) v0.8.1 (18-CNL)
Instruction Set: x64 AVX2 + ADX x64 AVX512-DQ x64 AVX512-VBMI x64 AVX2 + ADX x64 AVX512-DQ x64 AVX512-VBMI
25,000,000 2.857 2.467 1.988 0.907 0.853 0.655
50,000,000 6.446 5.501 4.392 2.075 1.862 1.406
100,000,000 14.335 12.257 9.490 4.176 3.749 3.005
250,000,000 42.566 36.204 27.137 12.014 10.705 8.576
500,000,000 99.040 85.443 64.359 28.805 24.123 19.747
1,000,000,000 228.863 198.405 151.605 63.898 55.264 42.727
2,500,000,000       187.882 148.423 123.523
5,000,000,000       375.130 327.776 247.705
10,000,000,000       794.573 709.606 526.212
Credit:            

 

 

 

Mainstream Desktops:

Processor(s): Ryzen 5 7600 Core i9 11700K Ryzen 9 3950X Ryzen 9 5950X Intel Core i9 13900KS Ryzen 9 7950X
Generation: AMD Zen 4 Intel Rocket Lake AMD Zen 2 AMD Zen 3 Intel Raptor Lake AMD Zen 4
Cores/Threads: 6/12 8/16 16/32 16/32 24/32 16/32
Processor Speed:   stock stock stock 5.7/4.5 GHz stock
Memory: 32 GB 32 GB - 3200 MT/s 128 GB - 2666 MT/s 64 GB - 3200 MT/s 96 GB - 8000 MT/s 128 GB - 4400 MT/s 128 GB - 5200 MT/s
Program Version: v0.8.1 (22-ZN4) v0.8.1 (18-CNL) v0.8.1 (19-ZN2) v0.8.1 (19-ZN2) v0.8.1 (14-BDW) v0.8.1 (22-ZN4)
Instruction Set: x64 AVX512-GFNI x64 AVX512-VBMI x64 AVX2 + ADX x64 AVX2 + ADX x64 AVX2 + ADX x64 AVX512-GFNI
25,000,000 0.439 0.501 0.588 0.490 0.241 0.312 0.307
50,000,000   1.114 1.257 1.090 0.525 0.679 0.654
100,000,000   2.223 2.685 2.345 1.132 1.517 1.410
250,000,000   6.220 7.251 6.371 3.185 4.157 3.820
500,000,000 13.378 13.573 15.556 13.395 7.065 8.883 8.062
1,000,000,000 29.497 30.415 33.925 29.301 15.901 18.542 17.039
2,500,000,000 83.421 86.119 96.695 82.204 44.888 50.743 46.467
5,000,000,000 181.647 193.718 215.333 181.355 99.566 110.379 101.345
10,000,000,000     473.958 399.012   241.162 220.522
25,000,000,000     1,361.732     680.344 623.493
Credit: Joel Rufin Oliver Kruse

 

Oliver Kruse 曾 铮    
Processor(s): Core i7 920 FX-8350 Core i7 4770K Ryzen 7 1800X Ryzen 7 3800X
Generation: Intel Nehalem AMD Piledriver Intel Haswell AMD Zen 1 AMD Zen 2
Cores/Threads: 4/8 8/8 4/8 8/16 8/16
Processor Speed: 3.5 GHz stock 4.0 GHz stock stock
Memory: 12 GB - 1333 MT/s 32 GB - 1600 MT/s 32 GB - 2133 MT/s 64 GB - 2866 MT/s 32 GB - 3600 MT/s
Program Version: v0.8.1 (08-NHM) v0.8.1 (11-BD1) v0.8.1 (13-HSW) v0.8.1 (17-ZN1) v0.8.1 (19-ZN2)
Instruction Set: x64 SSE4.1 x64 FMA4 x64 AVX2 x64 AVX2 + ADX x64 AVX2 + ADX
25,000,000 7.032 3.677 1.546 1.150 0.654
50,000,000 17.174 7.703 3.259 2.527 1.415
100,000,000 36.164 16.576 6.987 5.555 3.028
250,000,000 105.789 46.597 19.588 15.760 8.404
500,000,000 236.096 103.165 43.197 34.659 18.440
1,000,000,000 531.676 230.780 96.845 78.690 41.097
2,500,000,000   669.594 274.336 220.278 117.788
5,000,000,000   1,460.714 606.605 493.388 266.719
10,000,000,000       1,078.187  
25,000,000,000          
Credit:         Oliver Kruse

 

 

 

High-End Desktops:

Processor(s): Core i7 5960X Threadripper 1950X Core i9 7900X Core i9 7940X Threadripper 3990X Xeon W7-2495X Xeon W9-3475X
Generation: Intel Haswell AMD Zen 1 Intel Skylake X Intel Skylake X AMD Zen 2 Intel Sapphire Rapids Intel Sapphire Rapids
Cores/Threads: 8/16 16/32 10/20 14/28 64/128 24/48 36/72
Processor Speed: 4.0 GHz stock ~3.6 GHz (200W PL) 3.6 GHz (AVX512) 2.9 GHz 4.1-4.9 GHz 4.2-4.9 GHz
Memory: 64 GB - 2400 MT/s 64 GB - 2800 MT/s 128 GB - 3000 MT/s 128 GB - 3466 MT/s ~141 GB - 2666 MT/s 64 GB - 6400 MT/s 128 GB - 6400 MT/s
Program Version: v0.8.1 (13-HSW) v0.8.1 (17-ZN1) v0.8.1 (17-SKX) v0.8.1 (17-SKX) v0.8.1 (19-ZN2) v0.8.1 (18-CNL) v0.8.3 (18-CNL)
Instruction Set: x64 AVX2 x64 AVX2 + ADX x64 AVX512-DQ x64 AVX512-DQ x64 AVX2 + ADX x64 AVX512-VBMI x64 AVX512-VBMI
25,000,000 0.807 0.756 0.522 0.404 0.584 0.170 0.201
50,000,000 1.743 1.579 1.028 0.721 1.181 0.340 0.321
100,000,000 3.647 3.273 2.048 1.451 2.409 0.726 0.586
250,000,000 10.088 8.990 5.752 4.056 5.724 2.068 1.413
500,000,000 22.075 19.604 12.830 9.017 10.881 4.588 2.627
1,000,000,000 49.232 43.014 28.906 20.518 21.496 10.190 5.924
2,500,000,000 139.404 121.645 82.764 60.636 58.009 28.881 16.345
5,000,000,000 311.388 271.983 186.233 137.906 126.513 64.158 36.139
10,000,000,000 669.736 613.450 401.820 302.121 274.050 124.826 78.816
25,000,000,000     1,125.775 843.498 768.212   225.482
Credit:   Oliver Kruse     Paul Underwood 曾 铮

 

 

Multi-Processor Workstation/Servers:

 

Due to high core count and the effect of NUMA (Non-Uniform Memory Access), performance on multi-processor systems are extremely sensitive to various settings. Therefore, these benchmarks may not be entirely representative of what the hardware is capable of.

Processor(s):

Xeon Platinum 8375C

(AWS x2iedn.32xlarge)

Xeon Platinum 8488C

(AWS m7i.48xlarge)

Epyc 9R14

(AWS m7a.48xlarge)

Epyc 9R14

(AWS hpc7a.96xlarge)

Epyc 9754
Generation: Intel Sapphire Rapids Intel Sapphire Rapids AMD Genoa AMD Bergamo
Cores/Threads: 64/128 96/192 192/192 128/256 128/128
Processor Speed: 2.9 GHz 2.4 GHz 2.6 GHz 2.25 - 3.1 GHz
Memory: 4 TB 744 GB 740 GB 768 GB - 4800 MT/s
Program Version: v0.8.1 (18-CNL) v0.8.1 (18-CNL) v0.8.1 (22-ZN4) v0.8.1 (22-ZN4)
Instruction Set: x64 AVX512-VBMI x64 AVX512-VBMI x64 AVX512-GFNI x64 AVX512-GFNI
25,000,000 0.250 0.163 0.216 0.213 0.245 0.229
50,000,000 0.454 0.289 0.285 0.279 0.350 0.433
100,000,000 0.844 0.531 0.642 0.635 0.853 0.876
250,000,000 1.976 1.288 1.776 1.716 2.224 2.133
500,000,000 3.794 2.499 3.728 3.621 4.186 3.850
1,000,000,000 7.650 5.149 6.547 6.265 7.063 6.495
2,500,000,000 20.425 13.633 13.554 12.500 15.338 14.477
5,000,000,000 45.675 29.655 25.334 22.377 29.072 28.133
10,000,000,000 101.468 64.026 51.134 44.059 58.797 59.007
25,000,000,000 297.622 182.920 140.286 120.282 156.797 164.281
50,000,000,000 678.016 410.842 321.970 275.297 350.391 368.548
100,000,000,000 1,549.991 943.182 771.266 672.558 829.957 853.717
250,000,000,000 4,488.317          
500,000,000,000 9,685.971          
Credit: Greg Hogan Tim Wesley

 

Processor(s): Xeon Platinum 8124M Xeon Gold 6148 Xeon Platinum 8175M Xeon Platinum 8275CL Epyc 7742 Epyc 7B12 Epyc 7742
Generation: Intel Skylake Purley Intel Skylake Purley Intel Skylake Purley Intel Cascade Lake AMD Rome AMD Rome AMD Rome
Sockets/Cores/Threads: 2/36/72 2/40/40 2/48/96 2/48/96 2/128/256 2/112/224 2/128/256
Processor Speed: 3.0 GHz 2.4 GHz 2.5 GHz 3.0 GHz   2.25 GHz 2.25 GHz
Memory: 137 GB - ?? 188 GB - ?? ~756 GB - ?? 192 GB ~504 GB ~882 GB 2 TB
Program Version: v0.7.5 (17-SKX) v0.7.6 (17-SKX) v0.7.6 (17-SKX) v0.7.8 (17-SKX) v0.7.7 (17-ZN1) v0.7.8 (19-ZN2) v0.7.8 (19-ZN2)
Instruction Set: x64 AVX512-DQ x64 AVX512-DQ x64 AVX512-DQ x64 AVX512-DQ x64 AVX2 + ADX x64 AVX2 + ADX x64 AVX2 + ADX
25,000,000 0.540 0.329 0.294 0.283 0.534 0.439 0.513
50,000,000 0.981 0.683 0.617 0.544 1.027 0.838 0.920
100,000,000 1.905 1.456 1.305 1.169 2.298 1.796 1.887
250,000,000 5.085 3.737 3.591 3.125 5.854 4.509 4.650
500,000,000 10.372 7.750 7.293 6.309 10.502 8.196 8.066
1,000,000,000 21.217 16.550 15.041 13.042 17.836 14.252 13.246
2,500,000,000 55.701 45.693 39.329 34.028 35.485 30.592 27.011
5,000,000,000 118.151 99.078 83.601 71.777 62.432 58.405 49.940
10,000,000,000 247.928 212.984 176.695 153.169 115.543 116.900 98.156
25,000,000,000   599.653 491.988 425.442 307.995 314.907 258.081
50,000,000,000     1,081.181   690.662 741.633 598.716
100,000,000,000           1715.123 1,370.714
250,000,000,000             3,872.397
Credit: Jacob Coleman Oliver Kruse newalex Xinyu Miao Carsten Spille Greg Hogan Song Pengei
Processor(s): Xeon E5-2683 v3 Xeon E7-8880 v3 Xeon E5-2687W v4 Xeon E5-2686 v4 Xeon E5-2696 v4 Epyc 7601 Xeon Gold 6130F
Generation: Intel Haswell Intel Haswell Intel Broadwell Intel Broadwell Intel Broadwell AMD Naples Intel Skylake Purley
Sockets/Cores/Threads: 2/28/56 4/64/128 2/24/48 2/36/72 2/44/88 2/64/128 2/32/64
Processor Speed: 2.03 GHz 2.3 GHz 3.0 GHz 2.3 GHz 2.2 GHz 2.2 GHz 2.1 GHz
Memory: 128 GB - ??? 2 TB - ??? 64 GB 504 GB - ??? 768 GB - ??? 256 GB - ?? 256 GB - ??
Program Version: v0.6.9 (13-HSW) v0.7.1 (13-HSW) v0.7.6 (14-BDW) v0.7.7 (14-BDW) v0.7.1 (14-BDW) v0.7.3 (17-ZN1) v0.7.3 (17-SKX)
Instruction Set: x64 AVX2 x64 AVX2 x64 AVX2 + ADX x64 AVX2 + ADX x64 AVX2 + ADX x64 AVX2 + ADX x64 AVX512-DQ
25,000,000 0.907 1.176 0.490 0.494 0.715 2.459 1.150
50,000,000 1.745 2.321 1.072 0.982 1.344 4.347 1.883
100,000,000 3.317 4.217 2.303 2.193 2.673 6.996 3.341
250,000,000 8.339 8.781 6.196 6.044 6.853 14.258 7.731
500,000,000 17.708 15.879 13.046 12.582 14.538 24.930 15.346
1,000,000,000 37.311 32.078 27.763 26.852 31.260 47.837 31.301
2,500,000,000 102.131 78.251 76.202 73.596 84.271 111.139 82.871
5,000,000,000 218.917 164.157 165.046 160.094 192.889 228.252 179.488
10,000,000,000 471.802 346.307 356.487 346.305 417.322 482.777 387.530
25,000,000,000 1,511.852 957.966 1,006.131 980.784 1,186.881 1,184.144 1,063.850
50,000,000,000   2,096.169 2,202.558 2,156.854 2,601.476    
100,000,000,000   4,442.742     6,037.704    
250,000,000,000   17,428.450          
Credit: Shigeru Kondo Jacob Coleman Cameron Giesbrecht newalex "yoyo" Dave Graham

 

 

Fastest Times:

The full chart of rankings for each size can be found here:

These fastest times may include unreleased betas.


Got a faster time? Let me know: a-yee@u.northwestern.edu

Note that I usually do not respond to these emails. I simply put them into the charts which I update periodically (typically within 2 weeks).

 

 

Performance Tips:

 

Decimal Digits of Pi - Times in Seconds

Core i9 7940X @ 3.7 GHz AVX512

Memory Frequency: 2666 MT/s 3466 MT/s
25,000,000 0.839 0.758
50,000,000 1.424 1.338
100,000,000 2.701 2.425
250,000,000 6.489 5.877
500,000,000 13.307 11.917
1,000,000,000 27.913 24.915
2,500,000,000 76.837 68.322
5,000,000,000 168.058 148.737
10,000,000,000 365.047 322.115
25,000,000,000 1,037.527 916.039

High core count Skylake X processors are known to be heavily bottlenecked by memory bandwidth.

Memory Bandwidth:

 

Because of the memory-intensive nature of computing Pi and other constants, y-cruncher needs a lot of memory bandwidth to perform well. In fact, the program has been noticably memory bound on nearly all high-end desktops since 2012 as well as the majority of multi-socket systems since at least 2006.

 

Recommendations:

Don't be surprised if y-cruncher exposes instabilities that other applications and stress-tests do not. y-cruncher is unusual in that it simultaneously places a heavy load on both the CPU and the entire memory subsystem.

 

 

 

Parallel Performance:

 

y-cruncher has a lot of settings for tuning parallel performance. By default, it makes a best effort to analyze the hardware and pick the best settings. But because of the virtually unlimited combinations of processor topologies, it's difficult for y-cruncher to optimally pick the best settings for everything. So sometimes the best performance can only be achieved with manual settings.

*These are advanced settings that cannot be changed if you're using the benchmark option in the console UI. To change them, you will need to either run benchmark mode from the command line or use the custom compute menu.

 

Load imbalance is a faily common problem in y-cruncher. The usual causes are:

  1. The number of logical cores is not a power-of-two.
  2. The cores are not homogenous. Common reasons include:
    • The cores are clocked at different speeds.
    • The cores have access to different amounts of memory bandwidth due an imbalanced NUMA topology.
    • The cores are different generation cores hidden behind a virtual machine.
  3. CPU-intensive background processes are interfering with y-cruncher's ability to use all the hardware. This applies to all forms of system jitter.

 

 

Large Pages:

 

Large pages used to not matter in the past, but they do now in the post-Spectre/Meltdown world. Mitigations for the Meltdown vulnerability can have a noticeable performance drop for y-cruncher (up to 5% has been observed). It turns out that turning on large pages can mitigate the penalty for this mitigation. (pun intended)

 

Refer to the memory allocation guide on how to turn on large pages.

 

 

Swap Mode:

 

This is probably one of the most complicated features in y-cruncher.

 

 

Known Issues:

 

Everything in this section is in the process of being re-verified and moved to: https://github.com/Mysticial/y-cruncher/issues

 

 

Performance Issues:


Algorithms and Developments:

 

FAQ:

 

Pi and other Constants:

 

Program Usage:

 

Hardware and Overclocking:

 

Academia:

 

Programming:

 

Other:

 

Links:

Here's some interesting sites dedicated to the computation of Pi and other constants:

 

Questions or Comments

Contact me via e-mail. I'm pretty good with responding unless it gets caught in my school's junk mail filter.

You can also find me on Twitter as @Mysticial.