y-cruncher - A Multi-Threaded Pi-Program
From a high-school project that went a little too far...
By Alexander J. Yee
(Last updated: August 11, 2015)
The first scalable multi-threaded Pi-benchmark for multi-core systems...
How fast can your computer compute Pi?
y-cruncher is a program that can compute Pi and other constants to trillions of digits.
It is the first of its kind that is multi-threaded and scalable to multi-core systems. Ever since its launch in 2009, it has become a common benchmarking and stress-testing application for overclockers and hardware enthusiasts.
y-cruncher has been used to set several world records for the most digits of Pi ever computed.
Windows: Version 0.6.8 Build 9461 (Released: May 7, 2015)
Linux : Version 0.6.8 Build 9461 (Released: May 7, 2015)
Official Xtremesystems Forums thread.
200 billion digits of Catalan's Constant: (June 8, 2015)
I'm pleased to announce that after running for more than half a year, Robert Setti has computed Catalan's Constant to 200 billion digits.
This is very impressive because Catalan's Constant is one of the slowest to compute. (among popular constants that can be computed in quasi-linear time)
Version v0.6.8 (fix 2): (May 7, 2015)
A lot of unexpected personal stuff happened this last month. I'll be starting a new job next week that is potentially much more stressful than ever before.
So I've decided to push out all the remaining bugfixes for v0.6.8. Depending on how things turn out, this may very well be the last version until the Skylake Xeon.
y-cruncher has been used to set a number world record size computations.
Blue: Current World Record
Green: Former World Record
Red: Unverified computation. Does not qualify as a world record until verified using an alternate formula.
|Date Announced||Date Completed:||Source:||Who:||Constant:||Decimal Digits:||Time:||Computer:|
|August 11, 2015||August 7, 2015||Ron Watkins||Zeta(3) - Apery's Constant||250,000,000,000||4 x Xeon X6550 @ 2 GHz - 512 GB|
|July 24, 2015||July 22, 2015
July 23, 2015
|Golden Ratio||2,000,000,000,000||4 x Xeon X6550 @ 2 GHz - 512 GB
Xeon E5-2676 v3 @ 2.4 GHz - 64 GB
|July 13, 2015||July 12, 2015||Ron Watkins||Log(2)||250,000,000,000||4 x Xeon X6550 @ 2 GHz - 512 GB|
|July 8, 2015||July 5, 2015||Ron Watkins||Lemniscate||100,000,000,000||4 x Xeon X6550 @ 2 GHz - 512 GB|
|June, 26, 2015||June, 24, 2015||Matthew Hebert||e||1,400,000,000,000||FX-8370 @ 4.0 GHz - 8 GB
FX-8370 @ 4.0 GHz - 16 GB
|June 8, 2015||June 7, 2015||Robert Setti||Catalan's Constant||200,000,001,100||
|Intel Core i7 4820K @ 4.5 GHz
24 GB DDR3
|October 8, 2014||October 7, 2014||"houkouonchi"||Pi||13,300,000,000,000||
|2 x Xeon E5-4650L @ 2.6 GHz
192 GB DDR3 @ 1333 MHz
24 x 4 TB + 30 x 3 TB
|March 24, 2014||March 10, 2014||Shigeru Kondo||Log(10)||200,000,000,050||
|2 x Xeon E5-2690 @ 3.3 GHz
256 GB DDR3 @ 1600 MHz
12 x 3 TB
|December 28, 2013||December 28, 2013||Source||Shigeru Kondo||Pi||12,100,000,000,050||2 x Xeon E5-2690 @ 2.9 GHz
128 GB DDR3 @ 1600 MHz
24 x 3 TB
|December 22, 2013||December 22, 2013||Alexander Yee||Euler-Mascheroni Constant||119,377,958,182||
2 x Intel Xeon X5482 @ 3.2 GHz
64 GB DDR2 FB-DIMM
64 GB SSD (Boot) + 2 TB (Data)
8 x 2 TB (Computation)
|February 9, 2012||February 9, 2012||Alexander Yee||Square Root of 2||2,000,000,000,050||2 x Xeon X5482 @ 3.2 GHz - 64 GB
8 x 2 TB
Core i7 2600K @ 4.4 GHz - 16 GB
5 x 1 TB + 5 x 2 TB
See the complete list including other notably large computations.
Aside from computing Pi and other constants, y-cruncher is great for stress testing 64-bit systems with lots of ram.
Latest Release: (May 7, 2015)
- Windows Vista or later.
- You may need to install: Microsoft Visual C++ 2013 Redistributable Package
- Privilege elevation is needed to run y-cruncher. So this may result in UAC prompts.
See the FAQ for why y-cruncher needs privilege elevation.
- 64-bit Linux is required. There is no support for 32-bit.
- You may need to enable execute permissions. This can be done by running the following command on the y-cruncher directory: "chmod -R 777 *.out"
- An x86 or x64 processor with SSE3 instructions. This shouldn't be a problem since nearly all PCs since 2006 has them.
Main Page: y-cruncher - Version History
Other Downloads (for C++ programmers):
Comparison Chart: (Last updated: February 24, 2015)
Computations of Pi to various sizes. All times in seconds. All times include the time needed to convert the digits to decimal representation.
|Processor(s):||Core 2 Quad Q6600||Core i7 920||Core i7 3630QM||FX-8350||Core i7 4770K||Core i7 5960X|
|Generation:||Intel Core||Intel Nehalem||Intel Ivy Bridge||AMD Piledriver||Intel Haswell||Intel Haswell|
|Processor Speed:||2.4 GHz||3.5 GHz (OC)||2.4 GHz (3.2 GHz turbo)||4.0 GHz (4.2 GHz turbo)||4.0 GHz (OC)||4.0 GHz (OC)|
|Memory:||6 GB - 800 MHz||12 GB - 1333 MHz||8 GB - 1600 MHz||16 GB - 1333 MHz||32 GB - 1866 MHz||64 GB - 2666 MHz|
|Version:||v0.6.3 - SSE3||v0.6.3 - SSE4.1||v0.6.7 - AVX||v0.6.7 - XOP||v0.6.7 - AVX2||v0.6.7 - AVX2|
|Processor(s):||2 x Xeon X5482||2 x Xeon E5-2690*|
|Generation:||Intel Penryn||Intel Sandy Bridge|
|Processor Speed:||3.2 GHz||3.5 GHz|
|Memory:||64 GB - 800 MHz||256 GB - ???|
|Version:||v0.6.3 - SSE4.1||v0.6.2/3 - AVX|
*Credit to Shigeru Kondo.
The full chart of rankings for each size can be found here:
These fastest times may include unreleased betas.
Got a faster time? Let me know: email@example.com
Note that I usually don't respond to these emails. I simply put them into the charts which I update periodically.
Q: Is there a version that can use the GPU?
A: This is still a no-go for current generation GPUs. But things may get more interesting with Xeon Phi.
- As of 2015, most GPUs are optimized for single-precision performance. Their double-precision and 64-bit integer throughput is far from impressive. (with notable exceptions being the Nvidia Tesla and Titan Black cards)
The problem is that every single large integer multiplication algorithm uses either double-precision, 64-bit integer multiply, or carry-propagation. All of these operations are inefficient on current GPUs. And no, single-precision cannot be used because it imposes size limits that make the algorithms useless.
- GPUs require massive vectorization. Large number arithmetic is difficult to vectorize due to carry-propagation. The speedups that are currently achieved with CPU vectorization (SSE, AVX) are only modest at best.
- Large computations of Pi and other constants are not limited by computing power. The bottleneck is in the data communication. (memory bandwidth, disk I/O, etc...) So throwing GPUs at the problem (even if they could be utilized) would not help much.
Fundamental issues aside, the biggest practical barrier would be the need to rewrite the entire program using GPU programming paradigms.
It is worth mentioning the Xeon Phi co-processor line. Programming for these do not require a change of programming paradigm. Furthermore, the ISA convergence to AVX512 between Skylake and Knights Landing makes Xeon Phi even more attractive.
In the end, GPUs are still limited by PCIe bandwidth. Furthermore they do nothing to solve the disk I/O bottleneck. So while they may be interesting for small benchmarks, they won't offer much in terms breaking world records.
Q: Why can't you use distributed computing to set records?
A: No for more or less the same reasons that GPUs aren't useful.
- Just as with GPUs, computational power is not the bottleneck. It is the data communication. For this to be feasible as of 2015, everyone would need to have an internet connection speed of more than 1 GB/s. Anything slower than that and it's faster to do it on a single computer.
- Computing a lot of digits requires a lot of memory which would need to be distributed among all the participants. But there is no tolerance for data loss and distributed computing means that participants can freely join or leave the network at any time. Therefore, a tremendous amount of redundancy will be needed to ensure that no data is lost when participants leave.
Q: Is there a distributed version that performs better on NUMA and HPC clusters?
A: Not specifically. y-cruncher is still a shared memory program, so it inherently will not scale well into large networks.
However, there is one way that y-cruncher could (theoretically) scale. Most of the algorithms involved are cache-oblivious divide-and-conquer. This means that they (should) be able to scale given a hierarchical cache structure. However, this requires the following:
- Page Migration: When node A starts using memory that is mapped to node B, the corresponding pages need to be migrated to node A.
- Page Replication: If nodes A and B are reading the same memory, the corresponding pages need to be duplicated on both nodes.
These policies effectively emulate the hierarchical cache that isnecessary for cache-oblivious algorithms to run efficiently. Failing either of these and the computation will thrash the interconnect. That said, I'm not sure if any operating systems provide this sort of NUMA policy.
Thanks to Hyper-threading, y-cruncher is not overly sensitive to memory latency. It's the bandwidth that matters. So it may be beneficial to interleave memory across all the nodes to utilize all the bandwidth even if it hurts locality a bit.
On Linux this can be done using: numactl --interleave=all "./y-cruncher.out"
Q: Is there a publicly available library for the multi-threaded arithmetic that y-cruncher uses?
A: This was something I tried back in 2012, but it didn't work out. The problem is that the interface changes far too quickly for it to be maintainable.
That said, there is some on-going work to revive the project. But there's no telling when it will be ready.
Q: What's the deal with the privilege elevation? Why does y-cruncher need administrator privileges in Windows?
A: Privilege elevation is needed to work-around a security feature that would otherwise hurt performance.
In Swap Mode, y-cruncher creates large files and writes to them non-sequentially. When you create a new file and write to offset X, the OS will zero the file from the start to X. This zeroing is done for security reasons to prevent the program from reading data that has been leftover from files that have been deleted.
The problem is that this zeroing incurs a huge performance hit - especially when these swap files could be terabytes large. The only way to avoid this zeroing is to use the SetFileValidData() function which requires privilege elevation.
Linux doesn't have this problem since it implicitly uses sparse files.
Q: Why is the performance so poor for small computations? The program only gets xx% CPU utilization on my xx core machine for small sizes!!!
A: For small computations, there isn't much that can be parallelized. In fact, spawning N threads for an N core machine may actually take longer than the computation itself! In these cases, the program will decide not to use all available cores. Therefore, parallelism is really only helpful when there is a lot of work to be done.
As an additional note, y-cruncher has never been particularly efficient with the way it uses threads. Historically it always spawns new threads for tasks and then kills them when the task is done. While this works fine for large computations, it has a lot of overhead for small ones. This is partially being addressed in v0.6.8 with a revamp of the threading framework.
Q: Is y-cruncher open-sourced?
A: In short no. But roughly 5% of the code (mostly involving the Digit Viewer) is open sourced.
Here's some interesting sites dedicated to the computation of Pi and other constants:
Contact me via e-mail. I'm pretty good with responding unless it gets caught in my school's junk mail filter.