y-cruncher - A Multi-Threaded Pi-Program

From a high-school project that went a little too far...

By Alexander J. Yee

(Last updated: January 28, 2010)

 

Shortcuts:

 

The first scalable multi-threaded Pi-benchmark for multi-core systems...

 

Against the Big Guns...

Faster than SuperPi on single-core...

Faster than PiFast 4.3 on dual-core...

Faster than QuickPi 4.5 on quad-core...

 

1 billion digits of Pi in 10 minutes on 3.33 GHz Core i7.

 

See the official XtremeSystems thread for more benchmarks.

 

Latest Version:

Version 0.4.4 Build 7762b (fix 2) (Released: January 6, 2010)

 

Changes to v0.4.4 (fix 2):

 

Starting from v0.4.1, y-cruncher allows Pi computations of up to 200 billion digits.

However, y-cruncher has only been tested up to 50 billion digits. (Credit to Shigeru Kondo, WaiKin Wong, and Rickie Chang)

 

Records Set by y-cruncher:

World Record Size Computations

Date Announced Date Completed: Source: Who: Constant: Decimal Digits: Time: Computer:
April 16, 2009 April 16, 2009 Source A. Yee & R. Chan Catalan's Constant 31,026,000,000 Compute:  178 hours (7.4 days)
Verify:  221 hours (9.2 days)
"Nagisa"
March 13, 2009 March 13, 2009 Source A. Yee & R. Chan Euler-Mascheroni Constant 29,844,489,545 Compute:  205 hours (8.5 days)
Verify:  269 hours (11.2 days)
"Nagisa"
February 28, 2009 Source A. Yee & R. Chan Log(10) 31,026,000,000 Compute and Verify:
40 hours (1.7 days)
"Nagisa"
February 15, 2009 Source A. Yee & R. Chan Zeta(3) - Apery's Constant 31,026,000,000 Compute:  45 hours (1.9 days)
Verify:  44 hours (1.8 days)
"Nagisa"
February 4, 2009 Source A. Yee & R. Chan Log(2) 31,026,000,000 Compute:  24 hours, 10 minutes
Verify:  15 hours, 58 minutes
"Nagisa"
Janurary 31, 2009 January 31, 2009 Source A. Yee & R. Chan Catalan's Constant 15,510,000,000 Compute:  88 hours (3.7 days)
Verify:  100 hours (4.2 days)
"Nagisa"
Janurary 21, 2009 January 21, 2009 Source A. Yee & R. Chan Zeta(3) - Apery's Constant 15,510,000,000 Compute:  20 hours, 18 minutes
Verify:  21 hours, 1 minute
"Nagisa"
Janurary 18, 2009 January 18, 2009 Source A. Yee & R. Chan Euler-Mascheroni Constant 14,922,244,771 Compute:  96 hours (4 days)
Verify:  134 hours (5.5 days)
"Nagisa"
January 7, 2009 Source A. Yee & R. Chan Log(2) 15,500,000,000 Compute:  12 hours, 34 minutes
Verify:  8 hours, 20 minutes
"Nagisa"

 

World Record Speed Computations

See Fastest Times. Most of the times in that section are the world speed records.


The Storyline:

2005 - 2006:

The roots of y-cruncher date all the way back to my senior year in high school in my AP Computer Science class.

It started from a class project which was to write a multi-precision arithmetic library in Java that would support addition, subtraction, multiplication, and division.
After the assignment was due, I continued working on the library and named it "BigNumber". Some of the new features that were added were square roots, trig-functions, constants, etc...

 

June - October 2006:

After graduation, I began to take speed seriously. Multiplication was completely rewritten in C and linked back to BigNumber using JNI. This was around the time that I began to realize that parts of "BigNumber" were fairly fast - comparable to Mathematica. In particular, the function for computing the Euler-Mascheroni Constant was faster than that of Mathematica 5. By October, it came to realization that the world record of 108 million digits for the Euler-Mascheroni Constant was in reach.

 

November 2006:

With the goal of breaking the world record of 108 million digits for Euler's Constant in mind, November was spent entirely on implementing and optimizing the algorithms needed for extremely high precision arithmetic. I also upgraded my laptop from 512MB to 1.5GB of ram as that would be the computer that I would use for such a computation.

 

December 2006:

Finals week and with winter break approching, BigNumber was used to compute 116 million digits of Euler's Constant on my laptop for what appeared to be a new world record. The computation ran for 38.5 hours and the verification ran for 48 hours. It required 1.8 GB of memory.

 

Early 2007:

Lots of media attention... As well as a lot of hate mail saying that it was not a world record. (S. Kondo and S. Pagliarulo already had 2 billion digits, but they hadn't announced it.)

During this time I also made a number of minor improvements to BigNumber. Though all work was pretty much halted by April because of the release of a number of new video games.

 

Sometime between November - December 2007:

In the middle of one of my boring lecture classes - Lightbulb!!! The Hybrid NTT algorithm for multiplication was born. This effectively renewed my interest in this area.

 

2008:

BigNumber was rewritten from scratch in C++ and renamed y-cruncher.

("y" is gamma, the symbol for the Euler-Mascheroni Constant - but I still pronounce it as "y")

 

Click to expand this section. (Warning: technical terminology)

 

January 2009 (back from winter break):

With Nagisa back up and running, Raymond and I managed to break the world records for Log(2) and the Euler-Mascheroni Constant. (Main Article)

And with that, we released the first public version of y-cruncher.

 

By the end of the month, we had also taken the world records for Apery's Constant and Catalan's Constant.

No celebration though, since neither of us could legally drink yet...

 

 

Current:

Currently, y-cruncher is just a mere side-hobby. I no longer work on it as much as I did in 2008 - not even close by a long-shot.

Gaming and school-work now take priority.

 

The build numbers were started when the rewrite began back in January 2008. During the 9 months of active development before the first public release, there were 6000 builds. But during the 6 active months of development from January to October 2009, there were fewer than 1700 builds. (Again no work was done over the summer because of internship.)

 

Features:

Aside from computing π and other constants, y-cruncher is great for stress testing 64-bit systems with lots of ram.

Download:

Known Issues
(as of current release)

Version History:

Main Page: y-cruncher - Version History

 

Algorithms:

If you're interested in what formulas and algorithms y-cruncher uses:

Main Page: y-cruncher - Language and Algorithms

 

Current Release: Version 0.4.4 Build 7762b (fix 2)

Version 0.4.4 adds a specially optimized binary for AMD K10 processors. (Credit to Raymond Chan.)
Other than that, v0.4.4 is speed consistent with v0.4.3.

y-cruncher v0.4.4.7762b (fix 2).zip (4.91 MB)
 

Note that you may need to install one of the following updates in order to run the program.
Microsoft C++ 2008 Redistributable Package (x86)
Microsoft C++ 2008 Redistributable Package (x64)


Click here for older versions.


Please do not link directly to the file downloads as there may be newer versions.
Just link to http://www.numberworld.org/y-cruncher/#Download instead. Thanks!

 

 

 

 

 

 

Performance:

y-cruncher is the first efficient and publicly available Pi-calculator that can sustain a near 100% cpu load on multi-core computers.
There are other multi-threaded Pi-programs that can achieve high cpu usage, but few of them can sustain it through an entire Pi computation.

 

Below is a typical CPU utilization graph of y-cruncher when computing 1 billion digits of Pi across 8 cores.

 

y-cruncher also uses less memory than most other Pi-programs. It is also able to multi-thread WITHOUT significantly increasing memory usage.

 

Benchmarks:

Comparison Chart: (Last updated: January 11, 2009)

 

All times in seconds.

All benchmarks were done using the fastest binary with the fastest achieved settings for the system they were run on.

Number of Digits Core 2 Duo
(Merom)
2.0 GHz
Core 2 Quad
(8 MB cache)
2.4 GHz
Phenom II X4
3.2 GHz1
Core i7
2.67 GHz
2
Core i7
4.0 GHz
3
2 x Xeon
(Harpertown)
3.2 GHz
2 x Xeon
(Gainestown)
3.33 GHz
4
v0.4.3 v0.4.4 v0.4.4 v0.4.3 v0.4.3 v0.4.3 v0.4.3
1,000,000 1.085 0.752 0.544 0.439 0.306 0.456  
10,000,000 14.62 8.521 5.254 4.375 2.966 4.305  
100,000,000 248.1 84.58 65.86 50.22 34.41 38.10 25.10
1,000,000,000   1,183   696.5 478.6 468.2 322.0
10,000,000,000           6,291 4,481

1This was actually a 2.8 GHz Phenom II X3. It was unlocked to 4 cores and then overclocked to 3.2 GHz. Credit to Raymond Chan.

2Intel Turbo Boost Technology increases actual operating frequency to 2.8 GHz.

3Overclocked from 2.67 GHz. Actual operating frequency after Turbo Boost is 4.2 GHz.

4Intel Turbo Boost Technology increases actual operating frequency to 3.46 GHz. Credit to Shigeru Kondo.

Number of Digits Core 2 Quad
(6 MB cache)
2.66 GHz
Core i7
2.67 GHz
1
Core i7
4.0 GHz
2
2 x Opteron
(Shanghai)
3.34 GHz
3
2 x Xeon
(Harpertown)
3.2 GHz
2 x Xeon
(Gainestown)
3.2 GHz
4
v0.4.1 v0.4.2 v0.4.2 v0.4.2 v0.4.2 v0.4.1
1,000,000 0.918 0.536 0.366 0.617 0.716  
10,000,000 7.859 5.027 3.398 4.288 4.774  
100,000,000 103.1 62.58 42.07 42.31 41.56 28.14
1,000,000,000 1,360 844.6 574.4 552.9 520.2 365.2
10,000,000,000         6,999 4,961

1Intel Turbo Boost Technology increases actual operating frequency to 2.8 GHz.

2Overclocked from 2.67 GHz. Actual operating frequency after Turbo Boost is 4.2 GHz.

3Overclocked from 2.9 GHz. Credit to Hawkeye4077 from XtremeSystems.

4Credit to Shigeru Kondo. Possibly overclocked, but the submitter made no mention of the actual operating frequency.
There has been a report from someone (with identical processors and faster ram), that these timings are unattainable without overclocking.

 

 

Multi-core Scaling: How much faster is multi-threading?

Processor(s): CPU Frequency*: Memory: Memory Frequency: Multi-Threading Benefit: View Benchmark Data:
Intel Core 2 Quad Q6600 @ 2.4 GHz 2.4 GHz 6 GB DDR2 800 MHz 3.617 x View Benchmarks
Intel Core i7 920 @ 2.67 GHz 3.34 GHz (3.5 GHz Turbo Boost) 12 GB DDR3 1336 MHz 4.296 x View Benchmarks
2 x Intel Xeon X5482 Harpertown @ 3.2 GHz 3.2 GHz 64 GB DDR2 800 MHz 6.769 x View Benchmarks
Processor(s): CPU Frequency*: Memory: Memory Frequency: Multi-Threading Benefit: View Benchmark Data:
Intel Core i7 920 @ 2.67 GHz 2.67 GHz (2.8 GHz Turbo Boost) 12 GB DDR3 1066 MHz 4.220 x View Benchmarks
Intel Core 2 Quad Q9400 @ 2.66 GHz 2.66 GHz 8 GB DDR2 800 MHz 3.397 x View Benchmarks
Intel Core i7 920 @ 2.67 GHz 3.34 GHz (3.5 GHz Turbo Boost) 12 GB DDR3 1336 MHz 4.203 x View Benchmarks
2 x Intel Xeon X5482 Harpertown @ 3.2 GHz 3.2 GHz 64 GB DDR2 800 MHz 6.976 x View Benchmarks
Processor(s): CPU Frequency*: Memory: Memory Frequency: Multi-Threading Benefit: View Benchmark Data:
Intel Core i7 920 @ 2.67 GHz 3.2 GHz (3.36 GHz Turbo Boost) 6 GB DDR3 1600 MHz 4.180 x View Benchmarks
2 x Intel Xeon X5482 Harpertown @ 3.2 GHz 3.2 GHz 64 GB DDR2 800 MHz 7.023 x View Benchmarks

*Note that CPU frequencies higher than the stock frequency imply overclocking.

 

 

Other Benchmarks:

 

Random Screenshots: (from my test machines)

Pi - 500 million digits (6 minutes, 40 seconds) Pi - 1 billion digits (8 minutes) Pi - 10 billion digits (1 hour, 45 minutes)
2.8 GHz Phenom II X3
(Unlock to 4 Cores + Overclock to 3.2 GHz)
720 Deneb
2.67 GHz Core i7
(Overclock to 4.2 GHz)
920 Bloomsfield
Dual 3.2 GHz Quad-Core Xeon
X5482 Harpertown
4 GB DDR3
1333 MHz (dual channel)
12 GB DDR3
1200 MHz (triple channel)
64 GB DDR2 FB-DIMM
800 MHz (quad channel)

 

Fastest Times:

(Last updated: January 11, 2009)

 

As of September 2009, many (if not all) of the benchmarks in this section are also the world record fastest time in its category among any program.

All times in seconds.

 

Green indicates that the benchmark has been validated.

Red indicates that the benchmark was either not validated, or no validation was provided.

 

In the future, I may decide to allow only validated benchmarks on this list.

As of the current release, only Ram-Only Pi computations done using the Benchmark feature will be validated. However, starting from version 0.5.x (which is still in an early Alpha stage), all computations will have validation. This includes swap mode as well as all the other constants.

Desktop (Limit One Processor)
Digits Time Version Computer Credit
25,000,000 6.273 v0.4.4 x64 SSE4.1 Intel Core i7 950 @ 4.83 GHz - on Water 6 GB DDR3 rge @ XtremeSystems
50,000,000 13.533 v0.4.4 x64 SSE4.1 Intel Core i7 950 @ 4.83 GHz - on Water 6 GB DDR3 rge @ XtremeSystems
100,000,000 34.405 v0.4.3 x64 SSE4.1 Intel Core i7 920 @ 4.0 GHz (4.2 GHz Turbo Boost) - on Air 12 GB DDR3 Alexander Yee
36.629 v0.4.2 x64 SSE3 Intel Core i7 950 @ 4.76 GHz - on Water 6 GB DDR3 rge @ XtremeSystems
250,000,000 97.874 v0.4.3 x64 SSE4.1 Intel Core i7 920 @ 4.0 GHz (4.2 GHz Turbo Boost) - on Air 12 GB DDR3 Alexander Yee
111.488 v0.4.1 x64 SSE3 Intel Core i7 920 @ 4.19 GHz (4.4 GHz Turbo Boost) - on Water 3 GB DDR3 Aaron Gordon
500,000,000 202.881 v0.4.3 x64 SSE4.1 Intel Core i7 920 @ 4.09 GHz (4.29 GHz Turbo Boost) - on Air 6 GB DDR3 cheapseats @ XtremeSystems
1,000,000,000 449.062 v0.4.3 x64 SSE4.1 Intel Core i7 920 @ 4.09 GHz (4.29 GHz Turbo Boost) - on Air 6 GB DDR3 cheapseats @ XtremeSystems
2,500,000,000 1,346.26 v0.4.3 x64 SSE4.1 Intel Core i7 920 @ 4.0 GHz (4.2 GHz Turbo Boost) - on Air 12 GB DDR3 Alexander Yee
5,000,000,000 5,207.08 v0.4.1 x64 SSE3 Intel Core i7 920 @ 3.34 GHz (3.5 GHz Turbo Boost) - on Air 12 GB DDR3 Alexander Yee
10,000,000,000 13,328.9 v0.4.1 x64 SSE3 Intel Xeon X5482 @ 3.2 GHz 64 GB DDR2 Alexander Yee
25,000,000,000 - -   - - -
Desktop (Limit One Processor)
Digits Time Version Computer Credit
1M 1,048,576 0.248 v0.4.4 x64 SSE4.1 Intel Core i7 950 @ 4.83 GHz - on Water 6 GB DDR3 rge @ XtremeSystems
2M 2,097,152 0.598 v0.4.3 x64 SSE4.1 Intel Core i7 920 @ 4.4 GHz (4.62 GHz Turbo Boost) - on Air 12 GB DDR3 THERMAL-REACTOR @ Computer Forum
4M 4,194,304 1.193 v0.4.3 x64 SSE4.1 Intel Core i7 920 @ 4.4 GHz (4.62 GHz Turbo Boost) - on Air 12 GB DDR3 THERMAL-REACTOR @ Computer Forum
8M 8,388,608 2.352 v0.4.3 x64 SSE4.1 Intel Core i7 920 @ 4.4 GHz (4.62 GHz Turbo Boost) - on Air 12 GB DDR3 THERMAL-REACTOR @ Computer Forum
16M 16,777,216 4.611 v0.4.3 x64 SSE4.1 Intel Core i7 920 @ 4.4 GHz (4.62 GHz Turbo Boost) - on Air 12 GB DDR3 THERMAL-REACTOR @ Computer Forum
32M 33,554,432 8.656 v0.4.4 x64 SSE4.1 Intel Core i7 950 @ 4.83 GHz - on Water 6 GB DDR3 rge @ XtremeSystems
64M 67,108,864 20.425 v0.4.3 x64 SSE4.1 Intel Core i7 920 @ 4.4 GHz (4.62 GHz Turbo Boost) - on Air 12 GB DDR3 THERMAL-REACTOR @ Computer Forum
128M 134,217,728 45.596 v0.4.3 x64 SSE4.1 Intel Core i7 920 @ 4.4 GHz (4.62 GHz Turbo Boost) - on Air 12 GB DDR3 THERMAL-REACTOR @ Computer Forum
256M 268,435,456 98.593 v0.4.3 x64 SSE4.1 Intel Core i7 920 @ 4.4 GHz (4.62 GHz Turbo Boost) - on Air 12 GB DDR3 THERMAL-REACTOR @ Computer Forum
512M 536,870,912 226.908 v0.4.4 x64 SSE4.1 Intel Core i7 920 @ 4.0 GHz (4.2 GHz Turbo Boost) - on Water 6 GB DDR3 JET @ Computer Forum
1G 1,073,741,824 511.093 v0.4.3 x64 SSE4.1 Intel Core i7 920 @ 4.0 GHz (4.2 GHz Turbo Boost) - on Air 12 GB DDR3 Alexander Yee
2G 2,147,483,648 1129.22 v0.4.3 x64 SSE4.1 Intel Core i7 920 @ 4.0 GHz (4.2 GHz Turbo Boost) - on Air 12 GB DDR3 Alexander Yee
4G 4,294,967,296 - -   - - -

 

Any Computer (No Processor Limit)
Digits Time Version Computer Credit
25,000,000 5.848 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 72 GB DDR3 Shigeru Kondo
50,000,000 11.538 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 72 GB DDR3 Shigeru Kondo
100,000,000 24.095 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 72 GB DDR3 Shigeru Kondo
250,000,000 65.055 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 6 GB DDR3 Dave Hunt
Movieman @ XtremeSystems
500,000,000 139.536 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 6 GB DDR3 Dave Hunt
Movieman @ XtremeSystems
1,000,000,000 306.688 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 6 GB DDR3 Dave Hunt
Movieman @ XtremeSystems
2,500,000,000 869.629 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 72 GB DDR3 Shigeru Kondo
5,000,000,000 1,912.270 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 72 GB DDR3 Shigeru Kondo
10,000,000,000 4,250.138 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 72 GB DDR3 Shigeru Kondo
25,000,000,000 15,450.138 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 72 GB DDR3 Shigeru Kondo
50,000,000,000 85,443.384 v0.4.4 x64 SSE3 4 x AMD Opteron 8356 @ 2.3 GHz 128 GB DDR2 WaiKin Wong + Rickie Chang
100,000,000,000 - -   - - -

Any Computer (No Processor Limit)
Digits Time Version Computer Credit
1M 1,048,576 0.248 v0.4.4 x64 SSE4.1 Intel Core i7 950 @ 4.83 GHz - on Water 6 GB DDR3 rge @ XtremeSystems
2M 2,097,152 0.598 v0.4.3 x64 SSE4.1 Intel Core i7 920 @ 4.4 GHz (4.62 GHz Turbo Boost) - on Air 12 GB DDR3 THERMAL-REACTOR @ Computer Forum
4M 4,194,304 1.067 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 72 GB DDR3 Shigeru Kondo
8M 8,388,608 2.060 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 72 GB DDR3 Shigeru Kondo
16M 16,777,216 3.949 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 72 GB DDR3 Shigeru Kondo
32M 33,554,432 7.504 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 72 GB DDR3 Shigeru Kondo
64M 67,108,864 15.827 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 72 GB DDR3 Shigeru Kondo
128M 134,217,728 32.739 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 72 GB DDR3 Shigeru Kondo
256M 268,435,456 69.308 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 6 GB DDR3 Dave Hunt
Movieman @ XtremeSystems
512M 536,870,912 149.683 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 6 GB DDR3 Dave Hunt
Movieman @ XtremeSystems
1G 1,073,741,824 328.764 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 6 GB DDR3 Dave Hunt
Movieman @ XtremeSystems
2G 2,147,483,648 731.146 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 72 GB DDR3 Shigeru Kondo
4G 4,294,967,296 1,595.959 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 72 GB DDR3 Shigeru Kondo
8G 8,589,934,592 3,689.989 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 72 GB DDR3 Shigeru Kondo
16G 17,179,869,184 8,184.953 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 72 GB DDR3 Shigeru Kondo
32G 34,359,738,368 24,047.321 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 72 GB DDR3 Shigeru Kondo
64G 68,719,476,736 - -   - - -

*These fastest times may include unreleased betas.
Got a faster time? Let me know: a-yee@northwestern.edu


FAQ:

Q:  Is there a Linux version?
A:  Hold On...

I've gotten about a billion requests so far (and they keep on coming). So I've finally taken the first steps to port to Linux. (A number of people have even offered to port the program to Linux themselves.)

The source code is pretty much ready to be compiled on Linux. All code that uses Windows-specific functions now have standard C or OpenMP* implementations.
All that's left is to get Linux and the Intel Compiler installed on one of my 64-bit machines to start testing...

No guarantees on when it will be ready though... as I have close to no experience with Linux.

*Just as a side note: OpenMP appears to have a tremendous amount of overhead compared to WinAPI. (25m Pi computations are 10% slower with OpenMP than with WinAPI.) This is probably due to the lack of explicit thread control in OpenMP. Therefore, I cannot gurantee that y-cruncher will run faster on Linux than Windows.
 
 
Q:  Can you make a CUDA version?
A:  Not yet...

Here are the major reasons:

  1. GPUs currently have very poor double-precision floating-point (DP-FP) performance. y-cruncher relies heavily on DP-FP for its speed.

     

  2. GPUs are highly vectorized. y-cruncher isn't ready for massive scalable vectorization.

     

  3. CUDA currently does not support recursion. (There's a lot of multi-way recursion in y-cruncher. I'm not inclined to try rewriting them using loops.)

     

  4. y-cruncher's purpose is efficiency on large computations. GPUs simply don't have enough ram to do large computations locally.

    The bandwidth between GPU and main memory will probably be a huge bottleneck. y-cruncher is already somewhat bottlenecked by bandwidth on a CPU. On a GPU, it will be much more bottlenecked because the GPU has much more computational power and the (GPU <--> main memory) bandwidth is usually less than (CPU <--> main memory) bandwidth.

    This holds even for benchmarking. If y-cruncher were able to fully utilize a GPU, benchmarks would be extremely fast - so fast that the largest computation that could be done in ram (either GPU ram, or CPU ram - it doesn't matter) would likely be too short to be a worthwhile benchmark.

     

  5. There is currently no set-in-stone standard for GP-GPU programming.

Note that Nvidia's upcoming Fermi-based video cards will solve a number of these issues. But for now, I'll play by ear.
 
Q:  How does y-cruncher compare to other programs?
A:  On single-core machines, y-cruncher is not the fastest. But on dual-core (or more) machines I know of no other program that can decisively beat y-cruncher.

However, I will NOT claim that y-cruncher is the fastest program for computing Pi.

Below is a table of the five fastest (publicly available) programs and how y-cruncher compares to them.

Program Author(s) Description + Environments where it beats y-cruncher
TachusPI Fabrice Bellard
  • Holder of the current world record for the most digits of Pi computed on both supercomputer and desktop.
  • Although the Windows version appears to be broken, the Linux version appears to be faster than y-cruncher at least on Core i7.
  • More details to come...
Parallel GMP-Chudnovsky David Carver + Hanhong Xue + GMP team
  • This is a paralleled version of GMP-Chudnovsky using OpenMP. It appeared back in October 2008 and was improved a month later. It runs much faster on AMD processors than Intel processors. Below a few million digits, this is the fastest Pi-program, period. But because of its use of GMP, the true speed of this program cannot be achieved in Windows due to the lack of assembly support.
  • On AMD K10 (in linux), the x64 version appears to beat y-cruncher (in Windows) for:
    • All computations below a million digits.
    • All single-threaded computations.
    • Dual-thread computations below a few million digits.
  • For larger computations with 4 or more cores, y-cruncher is still faster.
  • Although Parallel GMP-Chudnovsky is multi-threaded, it does not scale as well as y-cruncher. So even though it beats y-cruncher in clock-for-clock linear speed, it is slower when there are more than 2 cores.
QuickPi 4.5 Steve Pagliarulo
  • QuickPi is multi-threaded and supports x64 and SSE3.
  • Clock-for-clock, QuickPi 4.5 is faster than y-cruncher. Therefore it beats y-cruncher for single-threaded computations of less than a billion digits or so.
  • Like Parallel GMP-Chudnovsky, QuickPi 4.5 has trouble scaling up with cores. So for multi-threaded computations with 2 or more cores, y-cruncher is usually faster.
MaxxPi-Multi M. Bicak
  • MaxxPi-Multi is a relatively new program that is aimed at benchmarking. Although its purpose is not speed, it is nevertheless one of the fastest in the world. It supports SSE, multi-threading, and is the only "fast" program for computing Pi that has a GUI.
  • Clock-for-clock, MaxxPi-Multi is actually the only program in this table that is slower than y-cruncher. However, it scales decently well for the first few cores - enough to beat out GMP-Chudnovsky and PiFast 4.3 on quad-core.
  • Because MaxxPi-Multi is slower clock-for-clock, y-cruncher seems to beat it for all large computations regardless of the number of cores/threads.
GMP-Chudnovsky Hanhong Xue + GMP team
  • This is the original (single-threaded) version of GMP-Chudnovsky. It runs much faster on AMD processors than Intel processors.
  • Clock-for-clock, GMP-Chudnovsky is faster than y-cruncher. Therefore it beats y-cruncher for all single-threaded computations.
  • For multi-threaded computations with 2 or more cores, y-cruncher is still faster.
PiFast 4.3 Xavier Gourdon
  • PiFast - An old classic. It undisputedly held the title of "Fastest Program to Compute Pi" for quite a while until QuickPi passed it. Using only x86 and x87 FPU instructions, PiFast packs a very impressive speed. It is also one of the most memory efficient programs for computing Pi.
  • Clock-for-clock, PiFast 4.3 is virtually tied with y-cruncher 0.4.3 (x86). However, it is more efficient for larger computations. The cross-over point between PiFast 4.3 and single-threaded y-cruncher 0.4.3 (x86) is about 10 million digits on Core i7. Below that, y-cruncher is slightly faster. And above that, PiFast 4.3 is slightly faster.
  • Because of the virtual tie, any advantage for y-cruncher will tip the balance. Therefore any of (SSE3, x64, multi-threading) will make y-cruncher faster.

Just to clear up a few things: y-cruncher is intended to be fast, but not optimal. It is optimized for memory efficiency on large computations.
Utmost speed is not important as y-cruncher can probably be made 10 - 30% faster by relaxing memory constraints and using decimal arithmetic.
 
 
Q:  Why does y-cruncher run 4 threads on my 3-core system (8 threads on 6-core, etc...)
A:  This is due to practical restrictions in the algorithms that are used by y-cruncher. Because of the nature of the algorithms that y-cruncher uses, they are most efficiently paralleled when the thread count is a power of two. To deal with systems that don't have a power-of-two number of logical cores, y-cruncher simply rounds up to the next power of two.

The overhead of running extra threads is usually very small. Any load balancing issues that result from awkward thread-to-core ratios are usually resolved by further increasing the thread count. (as explained in the next Q/A)
 
Q:  Why does y-cruncher create more threads than I tell it to use? Because of this I can't get dual-core benchmarks on a quad core machine since it will use all 4 cores even in dual-core mode.
A:  This is by design and is NOT a bug. Because of the nature of some of the algorithms, I find it necessary to spam threads in order to maximize multi-core efficiency. The work-around is to go to "Processor Affinity" in Task Manager and uncheck the cores that you do not want y-cruncher to use. y-cruncher does not do this automatically because it "doesn't know which logical cores are the best to use".

I call this method "Thread Spamming". Yes, it sounds ridiculous. But it's a very simple and effective way to deal with load imbalance.
 
Q:  Is y-cruncher open-sourced?
A:  No.
 
Q:  Is there a publicly available static library for the multi-threaded arithmetic that y-cruncher uses?
A:  No. At least not now...

y-cruncher's arithmetic module is indeed isolated from rest of the program in its own library. I call it "YMP" (y-cruncher Multi-Precision Arithmetic Library), but it is also closed-sourced.

Currently, y-cruncher is the only thing that uses YMP in it's entirety. But there is a growing interest to use y-cruncher's FFT in some signal processing and optical-related work since it is significantly faster than FFTW in a number of performance critical applications.
 
Q:  Why are the version numbers so low? Is there going to be something big coming?
A:  Yes. y-cruncher will not reach version 1.0.0 before Advanced Swap Mode is completed. And that won't be for a while.
 
Q:  Who are you? Are you really still in college? What degrees do you have? etc...
A:  Yes I'm still in college. As of spring 2009, I am 21 years old and a Junior undergraduate student at Northwestern University just north of Chicago, Illinois.
Therefore, I don't even have a college diploma yet - let alone a masters or Ph.D... So I apologize if my tone of writing in this entire website is of a restless college student.
I am a computer enthusiast and a semi-die hard gamer. Outside of computers, my hobbies include bowling, piano, and Japanese Anime.
And lastly, no I don't speak Japanese (as much as I'd like to). Aside from English, I speak Cantonese and a tiny bit of Mandarin.

Links:

Here's some interesting sites dedicated to the computation of Pi and other constants:

Special Thanks

Questions or Comments

Contact me via e-mail. I'm pretty good with responding unless it gets caught in my school's junk mail filter.