y-cruncher - A Multi-Threaded Pi-Program

From a high-school project that went a little too far...

By Alexander J. Yee

12.1 trillion digits of Pi...
World Record for both Desktop and Supercomputer!

 

(Last updated: March 24, 2014)

 

Shortcuts:

 

The first scalable multi-threaded Pi-benchmark for multi-core systems...

 

How fast can your computer compute Pi?

 

y-cruncher is a program that can compute Pi and other constants to trillions of digits.

It is the first of its kind that is multi-threaded and scalable to multi-core systems. Ever since its launch in 2009, it has become a common benchmarking and stress-testing application for overclockers and hardware enthusiasts.

 

y-cruncher was used to set several world records for the most digits of Pi ever computed.

 

Current Release:

Windows: Version 0.6.4 Build 9424 (Released: March 14, 2014)

Linux      : Version 0.6.4 Build 9424 (Released: March 14, 2014)

 

Official Xtremesystems Forums thread.

 

News:

 

Pi Day and Version 0.6.4: (March 14, 2014)

 

Oh hey look at the date! The long promised (and overdue) version for AMD processors is finally done. The "x64 XOP ~ Miyu" binary is optimized for AMD processors and uses FMA4 and XOP instructions. It will not run on Intel processors.

 

 

Moving On:

 

AVX2 is next on the list. But progress has been severely hindered by numerous issues with the Visual Studio compiler. VS2012 has severe bugs in its AVX2 code generation. VS2013 has a 10 - 30% performance regression in AVX code generation. Both versions generate terrible FMA3 code.

 

Long story short, expect the next version of y-cruncher to see the return of the Intel Compiler...

 

 

Records Set by y-cruncher:

y-cruncher has been used to set a number world record size computations.

 

Blue: Current World Record

Green: Former World Record

Red: Unverified computation. Does not qualify as a world record until verified using an alternate formula.

Date Announced Date Completed: Source: Who: Constant: Decimal Digits: Time: Computer:
March 24, 2014 March 10, 2014   Shigeru Kondo Log(10) 200,000,000,050

Compute:  44.4 hours

Verify:  49.7 hours

 

Validations: 1, 2

2 x Xeon E5-2690 @ 3.3 GHz
256 GB DDR3 @ 1600 MHz
12 x 3 TB
February 28, 2014   Shigeru Kondo Log(2) 200,000,000,050

Compute:  55.8 hours

Verify:  56.5 hours

 

Validations: 1, 2

2 x Xeon E5-2690 @ 3.3 GHz
256 GB DDR3 @ 1600 MHz
12 x 3 TB
December 28, 2013 December 28, 2013 Source Shigeru Kondo &
Alexander Yee
Pi 12,100,000,000,050

Compute: 94 days

Verify: 46 hours

2 x Xeon E5-2690 @ 2.9 GHz
128 GB DDR3 @ 1600 MHz
24 x 3 TB
December 22, 2013 December 22, 2013   Alexander Yee Euler-Mascheroni Constant 119,377,958,182

Compute:  50 days

Verify:  38 days

 

Validations: 1, 2

"Nagisa"
2 x Intel Xeon X5482 @ 3.2 GHz
64 GB DDR2 FB-DIMM
64 GB SSD (Boot) + 2 TB (Data)
8 x 2 TB (Computation)
September 13, 2013 September 13, 2013 Source Setti Financial LLC Zeta(3) - Apery's Constant 200,000,001,000

Compute:  ~5 months

Not Verified

Intel Core i5-3570S @ 3.1 GHz
16 GB
April 8, 2013 April 8, 2013 Source Setti Financial LLC Catalan's Constant 100,000,000,000

Compute:  ~4 months

Not Verified

2 x Intel Xeon X5460 @ 3.16 GHz
16 GB DDR2
February 9, 2012 February 9, 2012   Alexander Yee Square Root of 2 2,000,000,000,050

Compute:  110 hours
Verify:  119 hours

2 x Xeon X5482 @ 3.2 GHz - 64 GB
8 x 2 TB
Core i7 2600K @ 4.4 GHz - 16 GB
5 x 1 TB + 5 x 2 TB
September 17, 2010 September 17, 2010 Source Alexander Yee Zeta(3) - Apery's Constant 100,000,001,000

Compute:  148 hours

Verify:  366 hours

"Nagisa" + "Ushio"
July 8, 2010 July 8, 2010 Source Alexander Yee Golden Ratio 1,000,000,000,000

Compute:  114 hours

Verify:  ~7 days*

*Not a continuous run.

"Nagisa"
2 x Intel Xeon X5482 @ 3.2 GHz
64 GB DDR2 FB-DIMM
1.5 TB (Boot + Output)
4 x 1 TB (2 x 2 RAID0) + 6 x 2 TB
July 5, 2010 July 5, 2010 Source Shigeru Kondo e 1,000,000,000,000

Compute: 224 hours

Verify: 219 hours

Intel Core i7 980X @ 3.33 GHz
12 GB DDR3
2 TB (Boot + Output)
8 x 1 TB (Computation)
April 16, 2009 April 16, 2009 Source Alexander Yee &
Raymond Chan
Catalan's Constant 31,026,000,000

Compute:  178 hours

Verify:  221 hours

"Nagisa"

See the complete list.

 

Features:

Aside from computing Pi and other constants, y-cruncher is great for stress testing 64-bit systems with lots of ram.

 

 

Download:

Sample Screenshot: 100 billion digits of Pi

 

Latest Release: (March 14, 2014)

Windows: y-cruncher v0.6.4.9424.zip (4.23 MB)
Linux      : y-cruncher v0.6.4.9424.tar.gz (5.55 MB)

 

System Requirements:

Click here for older versions.

 

Version History:

Main Page: y-cruncher - Version History

 

Other Downloads (for C++ programmers):

 

Advanced Documentation:

 

 

 

 

 

 

 

 

 

 

Known Issues:

 

Functionality Issues:

 

Performance Issues:

 

Benchmarks:

Comparison Chart: (Last updated: February 23, 2014)

 

Computations of Pi to various sizes. All times in seconds. All times include the time needed to convert the digits to decimal representation.

Processor(s): Core 2 Quad Q6600 Core i7 920 Core i7 3630QM FX-8350 Core i7 4770K 2 x Xeon X5482 2 x Xeon E5-2690*
Generation: Intel Merom Intel Nehalem Intel Ivy Bridge AMD Piledriver Intel Haswell Intel Penryn Intel Sandy Bridge
Cores/Threads: 4/4 4/8 4/8 8/8 4/8 8/8 16/32
Processor Speed: 2.4 GHz 3.5 GHz (OC) 2.4 GHz (3.2 GHz turbo) 4.0 GHz (4.2 GHz turbo) 4.0 GHz (OC) 3.2 GHz 3.5 GHz
Memory: 6 GB - 800 MHz 12 GB - 1333 MHz 8 GB - 1600 MHz 16 GB - 1333 MHz 32 GB - 1866 MHz 64 GB - 800 MHz 256 GB - ???
Version: v0.6.3 - SSE3 v0.6.3 - SSE4.1 v0.6.3 - AVX v0.6.4 - XOP v0.6.3 - AVX v0.6.3 - SSE4.1 v0.6.2/3 - AVX
25,000,000 12.925 6.852 5.435 7.207 3.819 6.923 2.283
50,000,000 27.713 14.272 11.596 13.908 7.954 14.386 4.295
100,000,000 59.752 30.910 25.594 27.797 16.733 28.242 8.167
250,000,000 171.932 86.899 73.017 71.436 48.497 76.197 20.765
500,000,000 388.090 191.235 174.005 153.344 106.544 157.537 42.394
1,000,000,000 862.522 429.040 404.577 338.529 240.086 346.963 89.920
2,500,000,000       1009.923 661.992 964.038 239.154
5,000,000,000         1468.196 2123.981 520.977
10,000,000,000           4633.681 1131.809
25,000,000,000             3341.281
50,000,000,000             7355.076

*Credit to Shigeru Kondo.

 

 

Fastest Times:

(Last updated: March 24, 2014)

 

The full chart of rankings for each size can be found here:

*These fastest times may include unreleased betas.
Got a faster time? Let me know: a-yee@u.northwestern.edu

 


Algorithms:

If you're interested in what formulas and algorithms y-cruncher uses:

 

Main Page: y-cruncher - Language and Algorithms

 

 

 

FAQ:

Q:  Is there a version that can use the GPU?
A:  No for the following reasons, but anything can change in the future.

  1. GPUs require massive vectorization. Large number arithmetic is difficult to vectorize due to carry-propagation.

  2. Large computations of Pi and other constants are not limited by computing power. The bottleneck is in the data communication. (memory bandwidth, disk I/O, etc...) So throwing GPUs at the problem (even if they could be utilized) would not help much.


Q:  What's the deal with the privilege elevation? Why does y-cruncher need administrator privileges in Windows?
A:  Privilege elevation is needed to work-around a security feature that would otherwise hurt performance.

In Swap Mode, y-cruncher creates large files and writes to them non-sequentially. When you create a new file and write to offset X, the OS will zero the file from the start to X. This zeroing is done for security reasons to prevent the program from reading data that has been leftover from files that have been deleted.

The problem is that this zeroing incurs a huge performance hit - especially when these swap files could be terabytes large. The only way to avoid this zeroing is to use the SetFileValidData() function which requires privilege elevation.

In Linux, the issue is avoided since it implicitly uses sparse files. However, this leads to file fragmentation - which is arguably worse.


Q:  Why is the performance so poor for small computations? The program only gets xx% CPU utilization on my xx core machine for small sizes!!!
A:  The reason is simple. For small computations, there isn't much that can be parallelized. In fact, spawning N threads for an N core machine may actually take longer than the computation itself! In these cases, the program will decide not to use all available cores. Therefore, parallelism is really only helpful when there is a lot of work to be done.

Now here's a layman explanation:

Suppose you had a 1000 page novel that needs to be proof-read. Now suppose you had 10 staff members. To speed up the process, you can assign 100 pages to each staff member. This basically makes the task 10x faster. This is an example of a "large computation" done using a "small" number of cores.

Now suppose you need to proof-read a 1 page paper. And you have a staff of 100 people. Are you going to tear the page into 100 small parts and give one to each person to proof-read? Furthermore, organizing your group of 100 staff members is probably going to take longer than proof reading the entire page yourself! This is an example of a "small computation" using a "large" number of cores.

Of course, this is a highly simplified explanation of what is really going on. But the overall idea is the same.
It should be noted that some tasks are easier to parallelize than others. Most of the multi-threaded benchmarks that are used in the overclocking community tend to be highly "synthetic" tasks that are extremely easy to parallelize. These "synthetic" tasks typically achieve perfect parallelism regardless of the size of the task. But most other applications that do any sort of "meaningful" task tend to be less ideal. (And no, I'm not trying to imply that computing Pi is in anyway meaningful.)

Q:  Is there a publicly available library for the multi-threaded arithmetic that y-cruncher uses?
A:  Not right now. It was a work-in-progress at one point that came very close to release, but stuff happens.

One of the biggest issues right now is that the API changes extremely rapidly - often on impulse due to unforseen situations arising from the normal development of y-cruncher. New functions are constantly being added, old ones removed, existing ones modified. A year ago, this library thing seemed like a great idea. I even wrote a number of mini programs that used it. But now a year later, the API has changed so much that none of these mini programs will compile anymore.

Obviously, the API was written for and remains heavily influenced by y-cruncher itself, forking a separate version for public release seems like a maintenance nightmare. So until I can find a better way to approach this, I can't see a library going public any time soon.

Q:  Is there a distributed version that performs better on NUMA and HPC clusters?
A:  Version v0.6.1 should have slightly better NUMA performance for extremely large computations (> 50 billion digits). But no, there is no version that is specially designed for large-scale NUMA or cluster systems.

For now, a speedup can be gained in Linux by running y-cruncher with interleaved memory: numactl --interleave=all "./x64 SSE3.out"

Q:  Is y-cruncher open-sourced?
A:  No.
 
Q:  Who are you?
A:  About me.

 

Links:

Here's some interesting sites dedicated to the computation of Pi and other constants:

 

Questions or Comments

Contact me via e-mail. I'm pretty good with responding unless it gets caught in my school's junk mail filter.