y-cruncher - A Multi-Threaded Pi-Program

From a high-school project that went a little too far...

By Alexander J. Yee

(Last updated: April 26, 2015)

 

Shortcuts:

 

The first scalable multi-threaded Pi-benchmark for multi-core systems...

 

How fast can your computer compute Pi?

 

y-cruncher is a program that can compute Pi and other constants to trillions of digits.

It is the first of its kind that is multi-threaded and scalable to multi-core systems. Ever since its launch in 2009, it has become a common benchmarking and stress-testing application for overclockers and hardware enthusiasts.

 

y-cruncher has been used to set several world records for the most digits of Pi ever computed.

 

Current Release:

Windows: Version 0.6.8 Build 9460 (Released: March 17, 2015)

Linux      : Version 0.6.8 Build 9460 (Released: March 17, 2015)

 

Official Xtremesystems Forums thread.

 

News:

 

Version v0.6.8 Patched: (March 17, 2015)

 

I totally knew this would happen... The moment I rush a release (for Pi Day), something breaks.

 

It turns out there was a very large (up to 5%) performance regression on Haswell processors that scales inversely with the memory bandwidth of the system. Normally, such large regressions are caught long before they can be released. But since my primary test machine has 4 channels of overclocked DDR4, I never noticed the regression at all. It was only after Shigeru Kondo reported this did I test it on a different Haswell machine which did reproduce the regression.

 

This screw-up involved a cache optimization that was designed into the algorithm from the very beginning rather than being added later as a result of profiling. Being a premature optimization, it backfired for small computations and had no noticeable effect for large ones on my machine. Therefore I disabled it sometime between v0.6.7 and v0.6.8. Not a big deal, I expect some of these premature micro-optimizations to backfire.

 

Well... It turns out that the premature optimization wasn't really that premature after all. On large computations, it reduces the demand on memory bandwidth. If the processor is bandwidth-starved (such as mainstream Haswell), it translates to an increase in performance. Therefore, this patch re-enables the optimization on all processors except for AMD Bulldozer - which takes a 10% performance hit for some unknown reason.

 

The sad part is that, back in 2012, I did this optimization because I predicted that it would reduce the demand on memory bandwidth. But it wasn't until 2015 did the hardware and software become fast enough for it to actually make a difference. During those 3 years, I completely forgot about the optimization and left it in the "on" position until a recent refactoring touched it. That prompted me to re-evaluate it and (erroneously) disable it for the Pi day v0.6.8 release.

 

 

Pi Day of the Century: (March 14, 2015)

 

Surprise! There was no way I could possibly pass this day up right?

 

y-cruncher v0.6.8 is out with some new features:

 

Older News

 

Records Set by y-cruncher:

y-cruncher has been used to set a number world record size computations.

 

Blue: Current World Record

Green: Former World Record

Red: Unverified computation. Does not qualify as a world record until verified using an alternate formula.

Date Announced Date Completed: Source: Who: Constant: Decimal Digits: Time: Computer:
April 26, 2015 April 25, 2015   Matthew Hebert e 1,400,000,000,000

Compute:  15 days

Not Verified

FX-8370 @ 4.0 GHz - 8 GB
April 1, 2015 April 12, 2015   BenHadad Lemniscate 55,000,000,000

Compute:  40.4 hours

Verify:  6.99 days

2 x Xeon E5-2670 v2 @ 2.5 GHz
(32 vCPU only) - 244 GB
Intel Core i7 4790K - 8 GB
October 8, 2014 October 7, 2014   "houkouonchi" Pi 13,300,000,000,000

Compute:  208 days

Verify:  182 hours

 

Validation File

2 x Xeon E5-4650L @ 2.6 GHz
192 GB DDR3 @ 1333 MHz
24 x 4 TB + 30 x 3 TB
March 24, 2014 March 10, 2014   Shigeru Kondo Log(10) 200,000,000,050

Compute:  44.4 hours

Verify:  49.7 hours

 

Validations: 1, 2

2 x Xeon E5-2690 @ 3.3 GHz
256 GB DDR3 @ 1600 MHz
12 x 3 TB
February 28, 2014   Shigeru Kondo Log(2) 200,000,000,050

Compute:  55.8 hours

Verify:  56.5 hours

 

Validations: 1, 2

2 x Xeon E5-2690 @ 3.3 GHz
256 GB DDR3 @ 1600 MHz
12 x 3 TB
December 28, 2013 December 28, 2013 Source Shigeru Kondo Pi 12,100,000,000,050

Compute: 94 days

Verify: 46 hours

2 x Xeon E5-2690 @ 2.9 GHz
128 GB DDR3 @ 1600 MHz
24 x 3 TB
December 22, 2013 December 22, 2013   Alexander Yee Euler-Mascheroni Constant 119,377,958,182

Compute:  50 days

Verify:  38 days

 

Validations: 1, 2

"Nagisa"
2 x Intel Xeon X5482 @ 3.2 GHz
64 GB DDR2 FB-DIMM
64 GB SSD (Boot) + 2 TB (Data)
8 x 2 TB (Computation)
September 13, 2013 September 13, 2013 Source Setti Financial LLC Zeta(3) - Apery's Constant 200,000,001,000

Compute:  ~5 months

Not Verified

Intel Core i5-3570S @ 3.1 GHz
16 GB
April 8, 2013 April 8, 2013 Source Setti Financial LLC Catalan's Constant 100,000,000,000

Compute:  ~4 months

Not Verified

2 x Intel Xeon X5460 @ 3.16 GHz
16 GB DDR2
February 9, 2012 February 9, 2012   Alexander Yee Square Root of 2 2,000,000,000,050

Compute:  110 hours
Verify:  119 hours

2 x Xeon X5482 @ 3.2 GHz - 64 GB
8 x 2 TB
Core i7 2600K @ 4.4 GHz - 16 GB
5 x 1 TB + 5 x 2 TB
September 17, 2010 September 17, 2010 Source Alexander Yee Zeta(3) - Apery's Constant 100,000,001,000

Compute:  148 hours

Verify:  366 hours

"Nagisa" + "Ushio"
July 8, 2010 July 8, 2010 Source Alexander Yee Golden Ratio 1,000,000,000,000

Compute:  114 hours

Verify:  ~7 days*

*Not a continuous run.

"Nagisa"
2 x Intel Xeon X5482 @ 3.2 GHz
64 GB DDR2 FB-DIMM
1.5 TB (Boot + Output)
4 x 1 TB (2 x 2 RAID0) + 6 x 2 TB
July 5, 2010 July 5, 2010 Source Shigeru Kondo e 1,000,000,000,000

Compute: 224 hours

Verify: 219 hours

Intel Core i7 980X @ 3.33 GHz
12 GB DDR3
2 TB (Boot + Output)
8 x 1 TB (Computation)
April 16, 2009 April 16, 2009 Source Alexander Yee &
Raymond Chan
Catalan's Constant 31,026,000,000

Compute:  178 hours

Verify:  221 hours

"Nagisa"

See the complete list including other notably large computations.

 

Features:

Aside from computing Pi and other constants, y-cruncher is great for stress testing 64-bit systems with lots of ram.

 

 

Download:

Sample Screenshot: 100 billion digits of Pi

Core i7 4770K @ 4.0 GHz - 32GB DDR3 @1866 MHz - 16 HDs

 

Latest Release: (March 17, 2015)

Windows: y-cruncher v0.6.8.9460.zip (9.92 MB)
Linux      : y-cruncher v0.6.8.9460.tar.gz (8.43 MB)

 

System Requirements:

Windows:

Linux:

All Systems:

 

Version History:

Main Page: y-cruncher - Version History

 

Other Downloads (for C++ programmers):

 

Advanced Documentation:

 

 

 

 

 

Known Issues:

 

Functionality Issues:

 

Performance Issues:

 

Benchmarks:

Comparison Chart: (Last updated: February 24, 2015)

 

Computations of Pi to various sizes. All times in seconds. All times include the time needed to convert the digits to decimal representation.

 

Single-Processor Desktops:

Processor(s): Core 2 Quad Q6600 Core i7 920 Core i7 3630QM FX-8350 Core i7 4770K Core i7 5960X
Generation: Intel Core Intel Nehalem Intel Ivy Bridge AMD Piledriver Intel Haswell Intel Haswell
Cores/Threads: 4/4 4/8 4/8 8/8 4/8 8/16
Processor Speed: 2.4 GHz 3.5 GHz (OC) 2.4 GHz (3.2 GHz turbo) 4.0 GHz (4.2 GHz turbo) 4.0 GHz (OC) 4.0 GHz (OC)
Memory: 6 GB - 800 MHz 12 GB - 1333 MHz 8 GB - 1600 MHz 16 GB - 1333 MHz 32 GB - 1866 MHz 64 GB - 2666 MHz
Version: v0.6.3 - SSE3 v0.6.3 - SSE4.1 v0.6.7 - AVX v0.6.7 - XOP v0.6.7 - AVX2 v0.6.7 - AVX2
25,000,000 12.925 6.852 5.302 6.188 2.180 1.502
50,000,000 27.713 14.272 11.239 11.629 4.733 2.929
100,000,000 59.752 30.910 24.130 23.839 10.206 5.822
250,000,000 171.932 86.899 69.358 63.987 28.675 15.593
500,000,000 388.090 191.235 154.446 142.572 63.602 34.570
1,000,000,000 862.522 429.040 358.208 307.381 139.011 74.362
2,500,000,000       869.671 398.734 212.347
5,000,000,000         863.474 450.680
10,000,000,000           976.559

 

Multi-Processor Workstation/Servers:

Processor(s): 2 x Xeon X5482 2 x Xeon E5-2690*
Generation: Intel Penryn Intel Sandy Bridge
Cores/Threads: 8/8 16/32
Processor Speed: 3.2 GHz 3.5 GHz
Memory: 64 GB - 800 MHz 256 GB - ???
Version: v0.6.3 - SSE4.1 v0.6.2/3 - AVX
25,000,000 6.923 2.283
50,000,000 14.386 4.295
100,000,000 28.242 8.167
250,000,000 76.197 20.765
500,000,000 157.537 42.394
1,000,000,000 346.963 89.920
2,500,000,000 964.038 239.154
5,000,000,000 2123.981 520.977
10,000,000,000 4633.681 1131.809
25,000,000,000   3341.281
50,000,000,000   7355.076

*Credit to Shigeru Kondo.

 

 

Fastest Times:

The full chart of rankings for each size can be found here:

These fastest times may include unreleased betas.


Got a faster time? Let me know: a-yee@u.northwestern.edu

Note that I usually don't respond to these emails. I simply put them into the charts which I update periodically.


Algorithms and Developments:

 

FAQ:

Q:  Is there a version that can use the GPU?

A:  This is still a no-go for current generation GPUs. But things may get more interesting with Xeon Phi.

  1. As of 2015, most GPUs are optimized for single-precision performance. Their double-precision and 64-bit integer throughput is far from impressive. (with notable exceptions being the Nvidia Tesla and Titan Black cards)

    The problem is that every single large integer multiplication algorithm uses either double-precision, 64-bit integer multiply, or carry-propagation. All of these operations are inefficient on current GPUs. And no, single-precision cannot be used because it imposes size limits that make the algorithms useless.

  2. GPUs require massive vectorization. Large number arithmetic is difficult to vectorize due to carry-propagation. The speedups that are currently achieved with CPU vectorization (SSE, AVX) are only modest at best.

  3. Large computations of Pi and other constants are not limited by computing power. The bottleneck is in the data communication. (memory bandwidth, disk I/O, etc...) So throwing GPUs at the problem (even if they could be utilized) would not help much.

Fundamental issues aside, the biggest practical barrier would be the need to rewrite the entire program using GPU programming paradigms.

 

It is worth mentioning the Xeon Phi co-processor line. Programming for these do not require a change of programming paradigm. Furthermore, the ISA convergence to AVX512 between Skylake and Knights Landing makes Xeon Phi even more attractive.

 

In the end, GPUs are still limited by PCIe bandwidth. Furthermore they do nothing to solve the disk I/O bottleneck. So while they may be interesting for small benchmarks, they won't offer much in terms breaking world records.

 

 

Q:  Why can't you use distributed computing to set records?

A:  No for more or less the same reasons that GPUs aren't useful.

  1. Just as with GPUs, computational power is not the bottleneck. It is the data communication. For this to be feasible as of 2015, everyone would need to have an internet connection speed of more than 1 GB/s. Anything slower than that and it's faster to do it on a single computer.

  2. Computing a lot of digits requires a lot of memory which would need to be distributed among all the participants. But there is no tolerance for data loss and distributed computing means that participants can freely join or leave the network at any time. Therefore, a tremendous amount of redundancy will be needed to ensure that no data is lost when participants leave.

 

Q:  Is there a distributed version that performs better on NUMA and HPC clusters?

A:  Not specifically. y-cruncher is still a shared memory program, so it inherently will not scale well into large networks.

 

As far as tweaks go, y-cruncher is known to be more sensitive to memory bandwidth than latency. So some performance can be gained by interleaving memory so that the bandwidth from all nodes are utilized. On Linux this can be done using: numactl --interleave=all "./y-cruncher.out"

 

 

Q:  Is there a publicly available library for the multi-threaded arithmetic that y-cruncher uses?

A:  This was something I tried back in 2012, but it didn't work out. The problem is that the interface changes far too quickly for it to be maintainable.

 

That said it is still possible to make this happen. The easy way is to fork out a static version of the library. But this means that it will never get updated with future optimizations. The harder approach is to build a compatibility layer to a static interface. But this will be increasingly difficult to do efficiently as the static interface falls further and further behind the internal interface. In some cases, it may not even be possible when a newer version completely removes an old feature. (This will happen in v0.6.8 when the concept of "threads" will be replaced with "task decomposition".)

 

 

Q:  What's the deal with the privilege elevation? Why does y-cruncher need administrator privileges in Windows?

A:  Privilege elevation is needed to work-around a security feature that would otherwise hurt performance.

 

In Swap Mode, y-cruncher creates large files and writes to them non-sequentially. When you create a new file and write to offset X, the OS will zero the file from the start to X. This zeroing is done for security reasons to prevent the program from reading data that has been leftover from files that have been deleted.

 

The problem is that this zeroing incurs a huge performance hit - especially when these swap files could be terabytes large. The only way to avoid this zeroing is to use the SetFileValidData() function which requires privilege elevation.

 

Linux doesn't have this problem since it implicitly uses sparse files.

 

 

Q:  Why is the performance so poor for small computations? The program only gets xx% CPU utilization on my xx core machine for small sizes!!!

A:  For small computations, there isn't much that can be parallelized. In fact, spawning N threads for an N core machine may actually take longer than the computation itself! In these cases, the program will decide not to use all available cores. Therefore, parallelism is really only helpful when there is a lot of work to be done.

 

As an additional note, y-cruncher has never been particularly efficient with the way it uses threads. Historically it always spawns new threads for tasks and then kills them when the task is done. While this works fine for large computations, it has a lot of overhead for small ones. This is partially being addressed in v0.6.8 with a revamp of the threading framework.

 

 

Q:  Is y-cruncher open-sourced?

A:  In short no. But roughly 5% of the code (mostly involving the Digit Viewer) is open sourced.

 

Links:

Here's some interesting sites dedicated to the computation of Pi and other constants:

 

Questions or Comments

Contact me via e-mail. I'm pretty good with responding unless it gets caught in my school's junk mail filter.