y-cruncher - A Multi-Threaded Pi-Program

From a high-school project that went a little too far...

By Alexander J. Yee

(Last updated: January 30, 2015)




The first scalable multi-threaded Pi-benchmark for multi-core systems...


How fast can your computer compute Pi?


y-cruncher is a program that can compute Pi and other constants to trillions of digits.

It is the first of its kind that is multi-threaded and scalable to multi-core systems. Ever since its launch in 2009, it has become a common benchmarking and stress-testing application for overclockers and hardware enthusiasts.


y-cruncher has been used to set several world records for the most digits of Pi ever computed.


Current Release:

Windows: Version 0.6.6 Build 9452 (Released: December 21, 2014)

Linux      : Version 0.6.6 Build 9452 (Released: December 21, 2014)


Official Xtremesystems Forums thread.




Version 0.6.7 Preview: (January 27, 2015)


Version 0.6.7 has been built and is undergoing final testing. But I have no idea how long that will take. While everything looks good on Windows, testing on Linux is currently blocked while I diagnose an instability on my storage workstation with Linux.


The likely source of the instability is a massive hardware upgrade in December. Since Windows is fine and Linux is unstable, I suspect it's a driver issue. But I have yet to sort it out. The fact that I'm not much of a Linux person isn't really helping the situation.


In any case, part of that hardware upgrade involved adding 8 hard drives to the machine for a total of 16 drives. So v0.6.7 consists of mostly swap mode improvements that I decided to do after playing around with this 16 hard drive toy.



The main feature of v0.6.7 is a swap-mode multiplication tester which has two purposes:

The second point is important for anyone attempting world record computations. As the sole developer of y-cruncher, I only have the resources to test large multiplications up to around 5 trillion digits. Which means that I cannot reach the sizes that are now required to set Pi world records.


In the past, there have been bugs in the multiplication which only manifest at sizes that nobody has ever reached before. The scenario that I want to avoid is for someone else to spend months attempting a world record only to fail because of a bug in my code. In a sense, it's somewhat miraculous that y-cruncher is 4 for 4 in world record attempts for Pi. (i.e. no fatal software bugs)



Older News


Records Set by y-cruncher:

y-cruncher has been used to set a number world record size computations.


Blue: Current World Record

Green: Former World Record

Red: Unverified computation. Does not qualify as a world record until verified using an alternate formula.

Date Announced Date Completed: Source: Who: Constant: Decimal Digits: Time: Computer:
October 8, 2014 October 7, 2014   "houkouonchi" Pi 13,300,000,000,000

Compute:  208 days

Verify:  182 hours


Validation File

2 x Xeon E5-4650L @ 2.6 GHz
192 GB DDR3 @ 1333 MHz
24 x 4 TB + 30 x 3 TB
March 24, 2014 March 10, 2014   Shigeru Kondo Log(10) 200,000,000,050

Compute:  44.4 hours

Verify:  49.7 hours


Validations: 1, 2

2 x Xeon E5-2690 @ 3.3 GHz
256 GB DDR3 @ 1600 MHz
12 x 3 TB
February 28, 2014   Shigeru Kondo Log(2) 200,000,000,050

Compute:  55.8 hours

Verify:  56.5 hours


Validations: 1, 2

2 x Xeon E5-2690 @ 3.3 GHz
256 GB DDR3 @ 1600 MHz
12 x 3 TB
December 28, 2013 December 28, 2013 Source Shigeru Kondo Pi 12,100,000,000,050

Compute: 94 days

Verify: 46 hours

2 x Xeon E5-2690 @ 2.9 GHz
128 GB DDR3 @ 1600 MHz
24 x 3 TB
December 22, 2013 December 22, 2013   Alexander Yee Euler-Mascheroni Constant 119,377,958,182

Compute:  50 days

Verify:  38 days


Validations: 1, 2

2 x Intel Xeon X5482 @ 3.2 GHz
64 GB SSD (Boot) + 2 TB (Data)
8 x 2 TB (Computation)
September 13, 2013 September 13, 2013 Source Setti Financial LLC Zeta(3) - Apery's Constant 200,000,001,000

Compute:  ~5 months

Not Verified

Intel Core i5-3570S @ 3.1 GHz
16 GB
April 8, 2013 April 8, 2013 Source Setti Financial LLC Catalan's Constant 100,000,000,000

Compute:  ~4 months

Not Verified

2 x Intel Xeon X5460 @ 3.16 GHz
16 GB DDR2
February 9, 2012 February 9, 2012   Alexander Yee Square Root of 2 2,000,000,000,050

Compute:  110 hours
Verify:  119 hours

2 x Xeon X5482 @ 3.2 GHz - 64 GB
8 x 2 TB
Core i7 2600K @ 4.4 GHz - 16 GB
5 x 1 TB + 5 x 2 TB
September 17, 2010 September 17, 2010 Source Alexander Yee Zeta(3) - Apery's Constant 100,000,001,000

Compute:  148 hours

Verify:  366 hours

"Nagisa" + "Ushio"
July 8, 2010 July 8, 2010 Source Alexander Yee Golden Ratio 1,000,000,000,000

Compute:  114 hours

Verify:  ~7 days*

*Not a continuous run.

2 x Intel Xeon X5482 @ 3.2 GHz
1.5 TB (Boot + Output)
4 x 1 TB (2 x 2 RAID0) + 6 x 2 TB
July 5, 2010 July 5, 2010 Source Shigeru Kondo e 1,000,000,000,000

Compute: 224 hours

Verify: 219 hours

Intel Core i7 980X @ 3.33 GHz
12 GB DDR3
2 TB (Boot + Output)
8 x 1 TB (Computation)
April 16, 2009 April 16, 2009 Source Alexander Yee &
Raymond Chan
Catalan's Constant 31,026,000,000

Compute:  178 hours

Verify:  221 hours


See the complete list.



Aside from computing Pi and other constants, y-cruncher is great for stress testing 64-bit systems with lots of ram.




Sample Screenshot: 100 billion digits of Pi


Latest Release: (December 21, 2014)

Windows: y-cruncher v0.6.6.9452.zip (7.32 MB)
Linux      : y-cruncher v0.6.6.9452.tar.gz (8.85 MB)


System Requirements:



All Systems:


Version History:

Main Page: y-cruncher - Version History


Other Downloads (for C++ programmers):


Advanced Documentation:






Known Issues:


Functionality Issues:


Performance Issues:



Comparison Chart: (Last updated: January 30, 2015)


Computations of Pi to various sizes. All times in seconds. All times include the time needed to convert the digits to decimal representation.


Single-Processor Desktops:

Processor(s): Core 2 Quad Q6600 Core i7 920 Core i7 3630QM FX-8350 Core i7 4770K Core i7 5960X
Generation: Intel Merom Intel Nehalem Intel Ivy Bridge AMD Piledriver Intel Haswell Intel Haswell
Cores/Threads: 4/4 4/8 4/8 8/8 4/8 8/16
Processor Speed: 2.4 GHz 3.5 GHz (OC) 2.4 GHz (3.2 GHz turbo) 4.0 GHz (4.2 GHz turbo) 4.0 GHz (OC) 4.0 GHz (OC)
Memory: 6 GB - 800 MHz 12 GB - 1333 MHz 8 GB - 1600 MHz 16 GB - 1333 MHz 32 GB - 1866 MHz 64 GB - 2666 MHz
Version: v0.6.3 - SSE3 v0.6.3 - SSE4.1 v0.6.3 - AVX v0.6.7 - XOP v0.6.7 - AVX2 v0.6.7 - AVX2
25,000,000 12.925 6.852 5.435 6.188 2.180 1.502
50,000,000 27.713 14.272 11.596 11.629 4.733 2.929
100,000,000 59.752 30.910 25.594 23.839 10.206 5.822
250,000,000 171.932 86.899 73.017 63.987 28.675 15.593
500,000,000 388.090 191.235 174.005 142.572 63.602 34.570
1,000,000,000 862.522 429.040 404.577 307.381 139.011 74.362
2,500,000,000       869.671 398.734 212.347
5,000,000,000         863.474 450.680
10,000,000,000           976.559


Multi-Processor Workstation/Servers:

Processor(s): 2 x Xeon X5482 2 x Xeon E5-2690*
Generation: Intel Penryn Intel Sandy Bridge
Cores/Threads: 8/8 16/32
Processor Speed: 3.2 GHz 3.5 GHz
Memory: 64 GB - 800 MHz 256 GB - ???
Version: v0.6.3 - SSE4.1 v0.6.2/3 - AVX
25,000,000 6.923 2.283
50,000,000 14.386 4.295
100,000,000 28.242 8.167
250,000,000 76.197 20.765
500,000,000 157.537 42.394
1,000,000,000 346.963 89.920
2,500,000,000 964.038 239.154
5,000,000,000 2123.981 520.977
10,000,000,000 4633.681 1131.809
25,000,000,000   3341.281
50,000,000,000   7355.076

*Credit to Shigeru Kondo.



Fastest Times:

The full chart of rankings for each size can be found here:

These fastest times may include unreleased betas.
Got a faster time? Let me know: a-yee@u.northwestern.edu



If you're interested in what formulas and algorithms y-cruncher uses:


Main Page: y-cruncher - Language and Algorithms





Q:  Is there a version that can use the GPU?
A:  No for the following reasons, but anything can change in the future.

  1. GPUs require massive vectorization. Large number arithmetic is difficult to vectorize due to carry-propagation.

  2. Large computations of Pi and other constants are not limited by computing power. The bottleneck is in the data communication. (memory bandwidth, disk I/O, etc...) So throwing GPUs at the problem (even if they could be utilized) would not help much.

Q:  What's the deal with the privilege elevation? Why does y-cruncher need administrator privileges in Windows?
A:  Privilege elevation is needed to work-around a security feature that would otherwise hurt performance.

In Swap Mode, y-cruncher creates large files and writes to them non-sequentially. When you create a new file and write to offset X, the OS will zero the file from the start to X. This zeroing is done for security reasons to prevent the program from reading data that has been leftover from files that have been deleted.

The problem is that this zeroing incurs a huge performance hit - especially when these swap files could be terabytes large. The only way to avoid this zeroing is to use the SetFileValidData() function which requires privilege elevation.

In Linux, the issue is avoided since it implicitly uses sparse files. However, this leads to file fragmentation - which is arguably worse.

Q:  Why is the performance so poor for small computations? The program only gets xx% CPU utilization on my xx core machine for small sizes!!!
A:  The reason is simple. For small computations, there isn't much that can be parallelized. In fact, spawning N threads for an N core machine may actually take longer than the computation itself! In these cases, the program will decide not to use all available cores. Therefore, parallelism is really only helpful when there is a lot of work to be done.

Now here's a layman explanation:

Suppose you had a 1000 page novel that needs to be proof-read. Now suppose you had 10 staff members. To speed up the process, you can assign 100 pages to each staff member. This basically makes the task 10x faster. This is an example of a "large computation" done using a "small" number of cores.

Now suppose you need to proof-read a 1 page paper. And you have a staff of 100 people. Are you going to tear the page into 100 small parts and give one to each person to proof-read? Furthermore, organizing your group of 100 staff members is probably going to take longer than proof reading the entire page yourself! This is an example of a "small computation" using a "large" number of cores.

Of course, this is a highly simplified explanation of what is really going on. But the overall idea is the same.
It should be noted that some tasks are easier to parallelize than others. Most of the multi-threaded benchmarks that are used in the overclocking community tend to be highly "synthetic" tasks that are extremely easy to parallelize. These "synthetic" tasks typically achieve perfect parallelism regardless of the size of the task. But most other applications that do any sort of "meaningful" task tend to be less ideal. (And no, I'm not trying to imply that computing Pi is in anyway meaningful.)

Q:  Is there a publicly available library for the multi-threaded arithmetic that y-cruncher uses?
A:  Not right now. It was a work-in-progress at one point that came very close to release, but stuff happens.

One of the biggest issues right now is that the API changes extremely rapidly - often on impulse due to unforseen situations arising from the normal development of y-cruncher. New functions are constantly being added, old ones removed, existing ones modified. A year ago, this library thing seemed like a great idea. I even wrote a number of mini programs that used it. But now a year later, the API has changed so much that none of these mini programs will compile anymore.

Obviously, the API was written for and remains heavily influenced by y-cruncher itself, forking a separate version for public release seems like a maintenance nightmare. So until I can find a better way to approach this, I can't see a library going public any time soon.

Q:  Is there a distributed version that performs better on NUMA and HPC clusters?
A:  Version v0.6.1 should have slightly better NUMA performance for extremely large computations (> 50 billion digits). But no, there is no version that is specially designed for large-scale NUMA or cluster systems.

For now, a speedup can be gained in Linux by running y-cruncher with interleaved memory: numactl --interleave=all "./x64 SSE3.out"

Q:  Is y-cruncher open-sourced?
A:  No.
Q:  Who are you?
A:  About me.



Here's some interesting sites dedicated to the computation of Pi and other constants:


Questions or Comments

Contact me via e-mail. I'm pretty good with responding unless it gets caught in my school's junk mail filter.