y-cruncher - A Multi-Threaded Pi-Program

From a high-school project that went a little too far...

By Alexander J. Yee

(Last updated: April 28, 2016)

 

Shortcuts:

 

The first scalable multi-threaded Pi-benchmark for multi-core systems...

 

How fast can your computer compute Pi?

 

y-cruncher is a program that can compute Pi and other constants to trillions of digits.

It is the first of its kind that is multi-threaded and scalable to multi-core systems. Ever since its launch in 2009, it has become a common benchmarking and stress-testing application for overclockers and hardware enthusiasts.

 

y-cruncher has been used to set several world records for the most digits of Pi ever computed.

 

Current Release:

Windows: Version 0.6.9 Build 9464 (Released: March 1, 2016)

Linux      : Version 0.6.9 Build 9464 (Released: March 1, 2016)

 

Official HWBOT thread.

Official Xtremesystems Forums thread.

 

News:

 

GUI Benchmark Wrapper and HWBOT Integration: (April 3, 2016)

 

I get asked these two questions a lot:

  1. Why don't you add a GUI for y-cruncher?
  2. Why isn't y-cruncher on HWBOT?

#1 never happened because I suck at UI programming and I didn't want that mixed in with performance critical code.

#2 never happened because the HWBOT benchmark API wasn't ready.

 

Well, both finally happening... More details here: http://forum.hwbot.org/showthread.php?t=155079

 

 

Pi Day and some Spin-off Projects: (March 14, 2016)

 

Anyone who has been following my GitHub profile for past year will know that I've been working a library that exposes the compute-engine of y-cruncher. Well that's finally done and pushed out the door. (It was actually completed in January, but I waited until now following my usual "wait several months for Q/A".)

 

In any case, the spin-off project consists of two components:

YMP stands for "y-cruncher Multi-Precision Library". For the most part, it's just another bignum library - except that it supports SIMD and parallelized large multiplication.

Number Factory is largely a test app for the YMP library. It implements much of the same functionality as y-cruncher, albeit more cleanly and less efficiently.

 

The two can be found on my GitHub: https://github.com/Mysticial/NumberFactory

Documentation for the library can be found here: http://www.numberworld.org/ymp/v1.0/

 

For now, the project is entirely experimental and is available only for 64-bit Windows with Visual Studio 2015. It is far from mature and there are no plans to support Linux in the near future. But at the very least, it will let people code things up that utilize y-cruncher's parallel large multiplication.

 

 

Older News

 

Records Set by y-cruncher:

y-cruncher has been used to set a number world record size computations.

 

Blue: Current World Record

Green: Former World Record

Red: Unverified computation. Does not qualify as a world record until verified using an alternate formula.

Date Announced Date Completed: Source: Who: Constant: Decimal Digits: Time: Computer:
April 28, 2016 April 28, 2016   "yoyo" Golden Ratio 2,500,000,000,000

Compute:  42.2 hours

Not Verified

2 x Intel Xeon E5-2696 v4 @ 2.2 GHz
768 GB
April 24, 2016 April 18, 2016   Ron Watkins Log(2) 500,000,000,000

Compute:  12.8 days

Verify:  14.4 days

4 x Xeon X5690 @ 3.47 GHz - 141 GB
April 17, 2016 April 12, 2016   Ron Watkins Catalan's Constant 250,000,000,000

Compute:  204 hours

Verify:  207 hours

4 x Xeon E5-4660 v3 @ 2.1 GHz
1 TB
April 9, 2015 April 3, 2015   Ron Watkins Lemniscate 200,000,000,000

Compute:  61.6 hours

Verify:  73.7 hours

4 x Xeon X6550 @ 2 GHz - 492 GB
April 9, 2016 April 3, 2016   Ron Watkins Log(10) 500,000,000,000

Compute:  14.4 days

Verify:  15.2 days

2 x Xeon X5690 @ 3.47 GHz
141 GB
April 3, 2016 April 2, 2016   Ron Watkins Square Root of 2 5,000,000,000,000

Compute:  19.6 days

Verify:  20.8 days

4 x Xeon X7350 @ 2.93 GHz
96 GB
March 16, 2016 March 15, 2016   Peter Trueb Euler-Mascheroni Constant 160,000,000,000

Compute:  60.6 hours

Verify:  104 hours

4 x Xeon E7-8890 v3 @ 2.5 GHz
1.25 TB
February 17, 2016 February 14, 2016   Ron Watkins e 1,500,000,000,000

Compute:  116 hours

Verify:  116 hours

2 x Xeon X5690 @ 3.47 GHz
141 GB
February 8, 2016 February 6, 2016   Mike A Catalan's Constant 500,000,000,000

Compute:  26.1 days

Not Verified

2 x Intel Xeon E5-2697 v3 @ 2.6 GHz
128 GB
December 21, 2015 December 21, 2015   Dipanjan Nag Zeta(3) - Apery's Constant 400,000,000,000

Compute:  22 days

Verify:  24 days

Xeon E5-2698B @ 2.0 GHz - 224 GB
July 24, 2015 July 22, 2015
July 23, 2015
Source Ron Watkins
Dustin Kirkland
Golden Ratio 2,000,000,000,000

Compute:  77.3 hours

Verify:  76.33 hours

Compute:  79.3 hours

Verify:  80.8 hours

4 x Xeon X6550 @ 2 GHz - 512 GB
Xeon E5-2676 v3 @ 2.4 GHz - 64 GB
October 8, 2014 October 7, 2014   "houkouonchi" Pi 13,300,000,000,000

Compute:  208 days

Verify:  182 hours

 

Validation File

2 x Xeon E5-4650L @ 2.6 GHz
192 GB DDR3 @ 1333 MHz
24 x 4 TB + 30 x 3 TB
December 28, 2013 December 28, 2013 Source Shigeru Kondo Pi 12,100,000,000,050

Compute: 94 days

Verify: 46 hours

2 x Xeon E5-2690 @ 2.9 GHz
128 GB DDR3 @ 1600 MHz
24 x 3 TB

See the complete list including other notably large computations.

 

Note that for anyone attempting to set a Pi world record: Should the attempt succeed, I kindly ask that you make yourself sufficiently available for external requests to access or download the digits in its entirety (at least until it is broken again by someone else).

 

Pi is popular enough that people do actually want to see the digits. Everything else is less important. Though it's worth keeping the digits for large computations of the slower constants as well since they are much more difficult to compute.

 

Features:

Aside from computing Pi and other constants, y-cruncher is great for stress testing 64-bit systems with lots of ram.

 

 

Download:

Sample Screenshot: 100 billion digits of Pi

Core i7 4770K @ 4.0 GHz - 32GB DDR3 @1866 MHz - 16 HDs

 

Latest Release: (March 1, 2016)

Windows

y-cruncher v0.6.9.9464.zip

10.6 MB

Linux (Static)

y-cruncher v0.6.9.9464-static.tar.gz

11.2 MB

Linux (Dynamic)

y-cruncher v0.6.9.9464-dynamic.tar.gz

7.66 MB

 

 

 

 

 

 

The Linux version comes in both statically and dynamically linked versions.

 

The static version should work on most Linux distributions, but lacks Cilk Plus. The dynamic version supports Cilk Plus, but is less portable due to the DLL dependency hell.

 

System Requirements:

Windows:

Linux:

All Systems:

Very old systems that don't meet these requirements may be able to run older versions of y-cruncher. Support goes all the way back to Windows XP and x86 without SSE.

 

Version History:

Main Page: y-cruncher - Version History

 

Other Downloads (for C++ programmers):

 

Advanced Documentation:

 

 

 

 

 

Known Issues:

 

Functionality Issues:

 

Performance Issues:

So while it may be difficult to believe, Windows is currently the more suitable OS for running y-cruncher.

 

Benchmarks:

Comparison Chart: (Last updated: March 3, 2016)

 

Computations of Pi to various sizes. All times in seconds.

The timings include the time needed to convert the digits to decimal representation, but not the time needed to write out the digits to disk.

 

 

Laptops:

Processor(s): Core i7 3630QM Core i7 6820HK
Generation: Intel Ivy Bridge Intel Skylake
Cores/Threads: 4/8 4/8
Processor Speed: 2.4 GHz (3.2 GHz turbo) 3.2 GHz
Memory: 8 GB - 1600 MHz 48 GB - 2133 MHz
Version: v0.6.9 - AVX v0.6.9 - AVX2
25,000,000 5.074 1.906
50,000,000 10.855 4.268
100,000,000 23.656 9.274
250,000,000 68.661 26.444
500,000,000 151.196 59.178
1,000,000,000 350.849 130.827
2,500,000,000   371.300
5,000,000,000   814.674
10,000,000,000   1782.586

 

Single-Processor Desktops:

Processor(s): Core 2 Quad Q6600 Core i7 920 FX-8350 Core i7 4770K Core i7 5960X Core i7 6700K*
Generation: Intel Core Intel Nehalem AMD Piledriver Intel Haswell Intel Haswell Intel Skylake
Cores/Threads: 4/4 4/8 8/8 4/8 8/16 4/8
Processor Speed: 2.4 GHz 3.5 GHz (OC) 4.0 GHz (4.2 GHz turbo) 4.0 GHz (OC) 4.0 GHz (OC) 4.6 GHz (OC)
Memory: 6 GB - 800 MHz 12 GB - 1333 MHz 16 GB - 1333 MHz 32 GB - 1866 MHz 64 GB - 2400 MHz 64 GB - 2800 MHz
Version: v0.6.9 - SSE3 v0.6.9 - SSE4.1 v0.6.9 - XOP v0.6.9 - AVX2 v0.6.9 - AVX2 v0.6.9 - AVX2
25,000,000 13.514 6.691 4.223 1.832 1.018 1.360
50,000,000 27.906 13.796 9.471 4.109 2.347 3.117
100,000,000 59.756 31.136 20.189 8.970 4.978 6.756
250,000,000 172.722 85.637 57.753 25.790 13.695 19.857
500,000,000 388.863 189.099 127.181 58.350 30.276 45.579
1,000,000,000 879.214 424.928 278.857 128.386 66.807 98.891
2,500,000,000     804.619 366.066 189.518 284.260
5,000,000,000       800.977 416.514 609.001
10,000,000,000         919.385 1328.204

*Credit to Oliver Kruse.

 

 

Multi-Processor Workstation/Servers:

Processor(s): 2 x Xeon X5482 2 x Xeon E5-2690* 2 x Xeon E5-2683 v3* 2 x Xeon E5-2696 v4**
Generation: Intel Penryn Intel Sandy Bridge Intel Haswell Intel Broadwell
Cores/Threads: 8/8 16/32 28/56 44/88
Processor Speed: 3.2 GHz 3.5 GHz 2.03 GHz 2.2 GHz
Memory: 64 GB - 800 MHz 256 GB - ??? 128 GB 768 GB
Version: v0.6.9 - SSE4.1 v0.6.2/3 - AVX v0.6.9 - AVX2 v0.6.9 - AVX2
25,000,000 6.578 2.283 0.907 0.949
50,000,000 12.965 4.295 1.745 1.738
100,000,000 26.194 8.167 3.317 3.240
250,000,000 71.395 20.765 8.339 8.006
500,000,000 157.234 42.394 17.708 17.416
1,000,000,000 347.962 89.920 37.311 36.401
2,500,000,000 979.043 239.154 102.131 100.409
5,000,000,000 2170.838 520.977 218.917 209.890
10,000,000,000 4768.905 1131.809 471.802 491.786
25,000,000,000   3341.281 1511.852 1324.022
50,000,000,000   7355.076   3038.402
100,000,000,000       6818.778

*Credit to Shigeru Kondo.

**Credit to "yoyo".

 

Fastest Times:

The full chart of rankings for each size can be found here:

These fastest times may include unreleased betas.


Got a faster time? Let me know: a-yee@u.northwestern.edu

Note that I usually don't respond to these emails. I simply put them into the charts which I update periodically.


Algorithms and Developments:

 

FAQ:

Q:  Where can I download the digits for Pi and other constants that are featured in the various record computations?

A:  There is currently no reliable place to get the digits. Due to the large sizes of the data, it simply isn't feasible to move them around.

 

Personally, I have an archive of some of the digits including the first 12.1 trillion digits of Pi. But because I'm on a consumer internet connection with rate and possibly bandwidth limits, it simply isn't possible for me to upload them. When I was still in school, I was able to seed some torrents since university connections are very fast. But that isn't possible anymore.

 

To answer the question directly, your best bet at getting the digits is to:

  1. Compute them locally using y-cruncher if you have the resources to do so.
  2. Contact the person who ran the computation and see if they still have the digits. (In all seriousness, this probably won't work since everyone seems to just delete the digits after the computation since they take too much space.)

Under extreme circumstances (if you're a famous professor trying to do research), I may make special arrangements to run research code locally on my machine. But this has happened only once and I was dealing with some pretty amazing professors which I didn't want to let down.

 

 

Q:  Is there a version that can use the GPU?

A:  This is still a no-go for current generation GPUs. But things may get more interesting with Xeon Phi.

  1. As of 2015, most GPUs are optimized for single-precision performance. Their double-precision and 64-bit integer throughput is far from impressive. (with notable exceptions being the Nvidia Tesla and Titan Black cards)

    The problem is that every single large integer multiplication algorithm uses either double-precision, 64-bit integer multiply, or carry-propagation. All of these operations are inefficient on current GPUs. And no, single-precision cannot be used because it imposes size limits that make the algorithms useless.

  2. GPUs require massive vectorization. Large number arithmetic is difficult to vectorize due to carry-propagation. The speedups that are currently achieved with CPU vectorization (SSE, AVX) are only modest at best.

  3. Large computations of Pi and other constants are not limited by computing power. The bottleneck is in the data communication. (memory bandwidth, disk I/O, etc...) So throwing GPUs at the problem (even if they could be utilized) would not help much.

Fundamental issues aside, the biggest practical barrier would be the need to rewrite the entire program using GPU programming paradigms.

 

It is worth mentioning the Xeon Phi co-processor line. Programming for these do not require a change of programming paradigm. Furthermore, the ISA convergence to AVX512 between Skylake and Knights Landing makes Xeon Phi even more attractive.

 

In the end, GPUs are still limited by PCIe bandwidth. Furthermore they do nothing to solve the disk I/O bottleneck. So while they may be interesting for small benchmarks, they won't offer much in terms breaking world records.

 

 

Q:  Why can't you use distributed computing to set records?

A:  No for more or less the same reasons that GPUs aren't useful.

  1. Just as with GPUs, computational power is not the bottleneck. It is the data communication. For this to be feasible as of 2015, everyone would need to have an internet connection speed of more than 1 GB/s. Anything slower than that and it's faster to do it on a single computer.

  2. Computing a lot of digits requires a lot of memory which would need to be distributed among all the participants. But there is no tolerance for data loss and distributed computing means that participants can freely join or leave the network at any time. Therefore, a tremendous amount of redundancy will be needed to ensure that no data is lost when participants leave.

 

Q:  Is there a distributed version that performs better on NUMA and HPC clusters?

A:  Not specifically. y-cruncher is still a shared memory program, so it inherently will not scale well into large networks.

 

However, there is one way that y-cruncher could (theoretically) scale. Most of the algorithms involved are cache-oblivious divide-and-conquer. This means that they (should) be able to scale given a hierarchical cache structure. However, this requires the following:

These policies effectively emulate the hierarchical cache that is necessary for cache-oblivious algorithms to run efficiently. Failing either of these and the computation will thrash the interconnect. That said, I'm not sure if any operating systems provide this sort of NUMA policy.

 

Thanks to Hyperthreading, y-cruncher is not overly sensitive to memory latency. It's the bandwidth that matters. So it may be beneficial to interleave memory across all the nodes to utilize all the bandwidth even if it hurts locality a bit.

On Linux this can be done using: numactl --interleave=all "./y-cruncher.out"

 

 

Q:  Is there a publicly available library for the multi-threaded arithmetic that y-cruncher uses?

A:  Yes. But it isn't as stable as a library should be. Support only exists for 64-bit Windows and backwards compatibility breaks on a regular basis.

 

 

Q:  What's the difference between "Total Computation Time" and "Total Time"? Which is relevant for benchmarks?

A:  "Total Computation Time" is the total time required to compute the constant. It does not include the time needed to write the digits to disk nor does it include the time needed to verify the base conversion. "Total Time" is the end-to-end time of the entire computation which includes everything.

 

The CPU utilization measurements cover the same thing as the "Total Computation Time". It does not include the output or the base convert verify.

 

For benchmarking, it's better to use the "Total Computation Time". A slow disk that takes a long time to write out the digits will affect neither the computation time nor the CPU utilization measurements. Most other Pi-programs measure time the same way, so y-cruncher does the same for better comparability. All the benchmark charts on this website as well as any forum threads run by myself will use the "Total Computation Time".

 

For world record size computations, we use the "Total Time" since everything is relevant - including down time. If the computation was done in several phases, the run-time that is put in the charts is the difference between the start and end dates.

 

There's currently no "standard" for extremely long-running computations that are neither benchmarks nor world record sized.

 

 

Q:  What's the deal with the privilege elevation? Why does y-cruncher need administrator privileges in Windows?

A:  Privilege elevation is needed to work-around a security feature that would otherwise hurt performance.

 

In Swap Mode, y-cruncher creates large files and writes to them non-sequentially. When you create a new file and write to offset X, the OS will zero the file from the start to X. This zeroing is done for security reasons to prevent the program from reading data that has been leftover from files that have been deleted.

 

The problem is that this zeroing incurs a huge performance hit - especially when these swap files could be terabytes large. The only way to avoid this zeroing is to use the SetFileValidData() function which requires privilege elevation.

 

Linux doesn't have this problem since it implicitly uses sparse files.

 

As of v0.6.10, elevation is no longer required to run y-cruncher and perform basic computations. Instead, y-cruncher will request that it be re-run with elevation in order to access certain features like swap mode.

 

 

Q:  Why is the performance so poor for small computations? The program only gets xx% CPU utilization on my xx core machine for small sizes!!!

A:  For small computations, there isn't much that can be parallelized. In fact, spawning N threads for an N core machine may actually take longer than the computation itself! In these cases, the program will decide not to use all available cores. Therefore, parallelism is really only helpful when there is a lot of work to be done.

 

For those who prefer academic terminology, y-cruncher has weak scalability, but not strong scalability. For a fixed computation size, is it not possible to sustain a fixed efficiency while increasing the number of processors. But it is possible if you increase the computation size as well.

 

 

Q:  Is y-cruncher open-sourced?

A:  y-cruncher itself is closed source. But some of the related side-projects like the Digit Viewer and the Number Factory app are open-sourced.

 

Links:

Here's some interesting sites dedicated to the computation of Pi and other constants:

 

Questions or Comments

Contact me via e-mail. I'm pretty good with responding unless it gets caught in my school's junk mail filter.