y-cruncher - A Multi-Threaded Pi-Program

From a high-school project that went a little too far...

By Alexander J. Yee

(Last updated: October 14, 2014)




The first scalable multi-threaded Pi-benchmark for multi-core systems...


How fast can your computer compute Pi?


y-cruncher is a program that can compute Pi and other constants to trillions of digits.

It is the first of its kind that is multi-threaded and scalable to multi-core systems. Ever since its launch in 2009, it has become a common benchmarking and stress-testing application for overclockers and hardware enthusiasts.


y-cruncher has been used to set several world records for the most digits of Pi ever computed.


Current Release:

Windows: Version 0.6.5 Build 9444b (fix 2) (Released: August 23, 2014)

Linux      : Version 0.6.5 Build 9444b (fix 2) (Released: August 23, 2014)


Official Xtremesystems Forums thread.




World Record - 13.3 trillion digits of Pi: (October 8, 2014)


I'm please to announce that "houkouonchi" (who wishes to remain anonymous) has set a new world record for the digits of Pi with 13,300,000,000,000 digits.


The computation took 208 days and was done using y-cruncher v0.6.3 on a workstation with the following specs:

Verification using the BBP formula was done by myself and took 182 hours on a Core i7 920 @ 3.5 GHz.

Overall, this computation was slower than Shigeru Kondo's 12.1 trillion because the machine had less disk bandwidth and was not dedicated to the task.


More details coming soon...


For now, the digits can be downloaded here*: http://fios.houkouonchi.jp:8080/pi/

You can contact houkouonchi at: houkouonchi@houkouonchi.jp


*In order to view and/or decompress the digits, you will need the Digit Viewer. It comes bundled with y-cruncher.



Version v0.6.5: (May 26, 2014)


It took way too long, but support for AVX2 has been added. The new binary targets Haswell processors and requires AVX2, FMA3, and BMI2 instructions.

Theoretically, it should also be able to run on AMD Excavator processors.


As a word of warning: On Haswell, the AVX2 binary runs considerably hotter than with just AVX. So please take care when running it (with or without overclock).

This is especially the case with all the thermal problems that Haswell has.



Records Set by y-cruncher:

y-cruncher has been used to set a number world record size computations.


Blue: Current World Record

Green: Former World Record

Red: Unverified computation. Does not qualify as a world record until verified using an alternate formula.

Date Announced Date Completed: Source: Who: Constant: Decimal Digits: Time: Computer:
October 8, 2014 October 7, 2014   "houkouonchi" Pi 13,300,000,000,000

Compute:  208 days

Verify:  182 hours


Validation File

2 x Xeon E5-4650L @ 2.6 GHz
192 GB DDR3 @ 1333 MHz
24 x 4 TB + 30 x 3 TB
March 24, 2014 March 10, 2014   Shigeru Kondo Log(10) 200,000,000,050

Compute:  44.4 hours

Verify:  49.7 hours


Validations: 1, 2

2 x Xeon E5-2690 @ 3.3 GHz
256 GB DDR3 @ 1600 MHz
12 x 3 TB
February 28, 2014   Shigeru Kondo Log(2) 200,000,000,050

Compute:  55.8 hours

Verify:  56.5 hours


Validations: 1, 2

2 x Xeon E5-2690 @ 3.3 GHz
256 GB DDR3 @ 1600 MHz
12 x 3 TB
December 28, 2013 December 28, 2013 Source Shigeru Kondo Pi 12,100,000,000,050

Compute: 94 days

Verify: 46 hours

2 x Xeon E5-2690 @ 2.9 GHz
128 GB DDR3 @ 1600 MHz
24 x 3 TB
December 22, 2013 December 22, 2013   Alexander Yee Euler-Mascheroni Constant 119,377,958,182

Compute:  50 days

Verify:  38 days


Validations: 1, 2

2 x Intel Xeon X5482 @ 3.2 GHz
64 GB SSD (Boot) + 2 TB (Data)
8 x 2 TB (Computation)
September 13, 2013 September 13, 2013 Source Setti Financial LLC Zeta(3) - Apery's Constant 200,000,001,000

Compute:  ~5 months

Not Verified

Intel Core i5-3570S @ 3.1 GHz
16 GB
April 8, 2013 April 8, 2013 Source Setti Financial LLC Catalan's Constant 100,000,000,000

Compute:  ~4 months

Not Verified

2 x Intel Xeon X5460 @ 3.16 GHz
16 GB DDR2
February 9, 2012 February 9, 2012   Alexander Yee Square Root of 2 2,000,000,000,050

Compute:  110 hours
Verify:  119 hours

2 x Xeon X5482 @ 3.2 GHz - 64 GB
8 x 2 TB
Core i7 2600K @ 4.4 GHz - 16 GB
5 x 1 TB + 5 x 2 TB
September 17, 2010 September 17, 2010 Source Alexander Yee Zeta(3) - Apery's Constant 100,000,001,000

Compute:  148 hours

Verify:  366 hours

"Nagisa" + "Ushio"
July 8, 2010 July 8, 2010 Source Alexander Yee Golden Ratio 1,000,000,000,000

Compute:  114 hours

Verify:  ~7 days*

*Not a continuous run.

2 x Intel Xeon X5482 @ 3.2 GHz
1.5 TB (Boot + Output)
4 x 1 TB (2 x 2 RAID0) + 6 x 2 TB
July 5, 2010 July 5, 2010 Source Shigeru Kondo e 1,000,000,000,000

Compute: 224 hours

Verify: 219 hours

Intel Core i7 980X @ 3.33 GHz
12 GB DDR3
2 TB (Boot + Output)
8 x 1 TB (Computation)
April 16, 2009 April 16, 2009 Source Alexander Yee &
Raymond Chan
Catalan's Constant 31,026,000,000

Compute:  178 hours

Verify:  221 hours


See the complete list.



Aside from computing Pi and other constants, y-cruncher is great for stress testing 64-bit systems with lots of ram.




Sample Screenshot: 100 billion digits of Pi


Latest Release: (August 23, 2014)

Windows: y-cruncher v0.6.5.9444b (fix 2).zip (7.76 MB)
Linux      : y-cruncher v0.6.5.9444b (fix 2).tar.gz (8.44 MB)


System Requirements:



All Systems:


Version History:

Main Page: y-cruncher - Version History


Other Downloads (for C++ programmers):


Advanced Documentation:






Known Issues:


Functionality Issues:


Performance Issues:



Comparison Chart: (Last updated: June 21, 2014)


Computations of Pi to various sizes. All times in seconds. All times include the time needed to convert the digits to decimal representation.

Processor(s): Core 2 Quad Q6600 Core i7 920 Core i7 3630QM FX-8350 Core i7 4770K 2 x Xeon X5482 2 x Xeon E5-2690*
Generation: Intel Merom Intel Nehalem Intel Ivy Bridge AMD Piledriver Intel Haswell Intel Penryn Intel Sandy Bridge
Cores/Threads: 4/4 4/8 4/8 8/8 4/8 8/8 16/32
Processor Speed: 2.4 GHz 3.5 GHz (OC) 2.4 GHz (3.2 GHz turbo) 4.0 GHz (4.2 GHz turbo) 4.0 GHz (OC) 3.2 GHz 3.5 GHz
Memory: 6 GB - 800 MHz 12 GB - 1333 MHz 8 GB - 1600 MHz 16 GB - 1333 MHz 32 GB - 1866 MHz 64 GB - 800 MHz 256 GB - ???
Version: v0.6.3 - SSE3 v0.6.3 - SSE4.1 v0.6.3 - AVX v0.6.4 - XOP v0.6.5 - AVX2 v0.6.3 - SSE4.1 v0.6.2/3 - AVX
25,000,000 12.925 6.852 5.435 7.207 3.237 6.923 2.283
50,000,000 27.713 14.272 11.596 13.908 6.672 14.386 4.295
100,000,000 59.752 30.910 25.594 27.797 14.560 28.242 8.167
250,000,000 171.932 86.899 73.017 71.436 41.889 76.197 20.765
500,000,000 388.090 191.235 174.005 153.344 92.372 157.537 42.394
1,000,000,000 862.522 429.040 404.577 338.529 205.992 346.963 89.920
2,500,000,000       1009.923 591.683 964.038 239.154
5,000,000,000         1311.937 2123.981 520.977
10,000,000,000           4633.681 1131.809
25,000,000,000             3341.281
50,000,000,000             7355.076

*Credit to Shigeru Kondo.



Fastest Times:

The full chart of rankings for each size can be found here:

*These fastest times may include unreleased betas.
Got a faster time? Let me know: a-yee@u.northwestern.edu



If you're interested in what formulas and algorithms y-cruncher uses:


Main Page: y-cruncher - Language and Algorithms





Q:  Is there a version that can use the GPU?
A:  No for the following reasons, but anything can change in the future.

  1. GPUs require massive vectorization. Large number arithmetic is difficult to vectorize due to carry-propagation.

  2. Large computations of Pi and other constants are not limited by computing power. The bottleneck is in the data communication. (memory bandwidth, disk I/O, etc...) So throwing GPUs at the problem (even if they could be utilized) would not help much.

Q:  What's the deal with the privilege elevation? Why does y-cruncher need administrator privileges in Windows?
A:  Privilege elevation is needed to work-around a security feature that would otherwise hurt performance.

In Swap Mode, y-cruncher creates large files and writes to them non-sequentially. When you create a new file and write to offset X, the OS will zero the file from the start to X. This zeroing is done for security reasons to prevent the program from reading data that has been leftover from files that have been deleted.

The problem is that this zeroing incurs a huge performance hit - especially when these swap files could be terabytes large. The only way to avoid this zeroing is to use the SetFileValidData() function which requires privilege elevation.

In Linux, the issue is avoided since it implicitly uses sparse files. However, this leads to file fragmentation - which is arguably worse.

Q:  Why is the performance so poor for small computations? The program only gets xx% CPU utilization on my xx core machine for small sizes!!!
A:  The reason is simple. For small computations, there isn't much that can be parallelized. In fact, spawning N threads for an N core machine may actually take longer than the computation itself! In these cases, the program will decide not to use all available cores. Therefore, parallelism is really only helpful when there is a lot of work to be done.

Now here's a layman explanation:

Suppose you had a 1000 page novel that needs to be proof-read. Now suppose you had 10 staff members. To speed up the process, you can assign 100 pages to each staff member. This basically makes the task 10x faster. This is an example of a "large computation" done using a "small" number of cores.

Now suppose you need to proof-read a 1 page paper. And you have a staff of 100 people. Are you going to tear the page into 100 small parts and give one to each person to proof-read? Furthermore, organizing your group of 100 staff members is probably going to take longer than proof reading the entire page yourself! This is an example of a "small computation" using a "large" number of cores.

Of course, this is a highly simplified explanation of what is really going on. But the overall idea is the same.
It should be noted that some tasks are easier to parallelize than others. Most of the multi-threaded benchmarks that are used in the overclocking community tend to be highly "synthetic" tasks that are extremely easy to parallelize. These "synthetic" tasks typically achieve perfect parallelism regardless of the size of the task. But most other applications that do any sort of "meaningful" task tend to be less ideal. (And no, I'm not trying to imply that computing Pi is in anyway meaningful.)

Q:  Is there a publicly available library for the multi-threaded arithmetic that y-cruncher uses?
A:  Not right now. It was a work-in-progress at one point that came very close to release, but stuff happens.

One of the biggest issues right now is that the API changes extremely rapidly - often on impulse due to unforseen situations arising from the normal development of y-cruncher. New functions are constantly being added, old ones removed, existing ones modified. A year ago, this library thing seemed like a great idea. I even wrote a number of mini programs that used it. But now a year later, the API has changed so much that none of these mini programs will compile anymore.

Obviously, the API was written for and remains heavily influenced by y-cruncher itself, forking a separate version for public release seems like a maintenance nightmare. So until I can find a better way to approach this, I can't see a library going public any time soon.

Q:  Is there a distributed version that performs better on NUMA and HPC clusters?
A:  Version v0.6.1 should have slightly better NUMA performance for extremely large computations (> 50 billion digits). But no, there is no version that is specially designed for large-scale NUMA or cluster systems.

For now, a speedup can be gained in Linux by running y-cruncher with interleaved memory: numactl --interleave=all "./x64 SSE3.out"

Q:  Is y-cruncher open-sourced?
A:  No.
Q:  Who are you?
A:  About me.



Here's some interesting sites dedicated to the computation of Pi and other constants:


Questions or Comments

Contact me via e-mail. I'm pretty good with responding unless it gets caught in my school's junk mail filter.