y-cruncher - A Multi-Threaded Pi-Program

From a high-school project that went a little too far...

By Alexander J. Yee

It Stands at 10 trillion digits of Pi...
World Record for both Desktop and Supercomputer!!!

 

(Last updated: December 21, 2011)

 

Shortcuts:

 

The first scalable multi-threaded Pi-benchmark for multi-core systems...

 

Against the Big Guns...

Faster than SuperPi on single-core...

Faster than PiFast 4.3 on dual-core...

Faster than QuickPi 4.5 on quad-core...

 

1 billion digits of Pi in 5 minutes on 6-core Core i7 @ 4.26 GHz

 

See the official XtremeSystems thread for more benchmarks.

 

Latest Version:

Windows: Version 0.5.5 Build 9180 (fix 2) (Released: April 6, 2011)

Linux      : Version 0.5.5 Build 9187 (fix 2) (Released: February 20, 2011)

 

Starting from v0.5.2, y-cruncher allows Pi computations of up to 10 trillion digits.

y-cruncher has been tested up to the current limit 10 trillion digits. (Credit: Shigeru Kondo)

 

Due to algorithmic limitations, v0.5.x is not capable of going above 10 trillion digits. (The exact limit is unknown, but is somewhere between 10 - 40 trillion...)

This will be solved in v0.6.1 with the addition of a new multiplication algorithm.

 

Records Set by y-cruncher:

World Record Size Computations

Date Announced Date Completed: Source: Who: Constant: Decimal Digits: Time: Computer:
October 17, 2011 October 16, 2011 Source Shigeru Kondo &
Alexander Yee
Pi 10,000,000,000,050

Compute:  371 days

Verify:  45 hours

2 x Intel Xeon X5680 @ 3.33 GHz
96 GB DDR3 @ 1066 MHz
24 x 2 TB
May 24, 2011 May 20, 2011   Shigeru Kondo Log(10) 100,000,000,000

Compute and Verify:
142 hours

Intel Core i7 2500K @ 4.8 GHz
16 GB DDR3 + 6 x 1 TB
May 14, 2011   Shigeru Kondo Log(2) 100,000,000,000

Compute and Verify:
142 hours

Intel Core i7 2500K @ 4.8 GHz
16 GB DDR3 + 6 x 1 TB
December 13, 2010 December 12, 2010   Alex Roberts Catalan's Constant 50,000,000,000

Compute:  35.5 days

Not Verified

AMD Phenom II X4 945 @ 3.0 GHz
8 GB
December 4, 2010 November 26, 2010   Alex Roberts Log(2) 50,000,000,000

Compute:  13.1 days

Independently Verified

Intel Core i7 720Q @ 1.60 GHz
4 GB DDR3
September 17, 2010 September 17, 2010 Source Alexander Yee Zeta(3) - Apery's Constant 100,000,001,000

Compute:  148 hours

Verify:  366 hours

"Nagisa" + "Ushio"
August 2, 2010 August 2, 2010 Source Shigeru Kondo &
Alexander Yee
Pi 5,000,000,000,000

Compute:  90 days

Verify:  64 hours

2 x Intel Xeon X5680 @ 3.33 GHz
96 GB DDR3 @ 1066 MHz
16 x 2 TB
July 8, 2010 July 8, 2010 Source Alexander Yee Golden Ratio 1,000,000,000,000

Compute:  114 hours

Verify:  ~7 days*

*Not a continuous run.

"Nagisa"
2 x Intel Xeon X5482 @ 3.2 GHz
64 GB DDR2 FB-DIMM
1.5 TB (Boot + Output)
4 x 1 TB (2 x 2 RAID0) + 6 x 2 TB
July 5, 2010 July 5, 2010 Source Shigeru Kondo e 1,000,000,000,000

Compute: 224 hours

Verify: 219 hours

Intel Core i7 980X @ 3.33 GHz
12 GB DDR3
2 TB (Boot + Output)
8 x 1 TB (Computation)
March 22, 2010 March 22, 2010 Source Shigeru Kondo Square Root of 2 1,000,000,000,000

Compute:  193 hours

Verify:  98 hours

Core i7 975 @ 4 GHz - 12GB
8 x 1 TB HDs
2 x Xeon W5590 - 144GB
16 x 2 TB HDs
February 21, 2010 February 20, 2010 Source Alexander Yee e 500,000,000,000

Compute and Verify:
307 hours

"Ushio"
April 16, 2009 April 16, 2009 Source Alexander Yee &
Raymond Chan
Catalan's Constant 31,026,000,000

Compute:  178 hours

Verify:  221 hours

"Nagisa"
March 13, 2009 March 13, 2009 Source Alexander Yee &
Raymond Chan
Euler-Mascheroni Constant 29,844,489,545

Compute:  205 hours

Verify:  269 hours

"Nagisa"
February 28, 2009 Source Alexander Yee &
Raymond Chan
Log(10) 31,026,000,000

Compute and Verify:
40 hours

"Nagisa"
February 15, 2009 Source Alexander Yee &
Raymond Chan
Zeta(3) - Apery's Constant 31,026,000,000

Compute:  45 hours

Verify:  44 hours

"Nagisa"
February 4, 2009 Source Alexander Yee &
Raymond Chan
Log(2) 31,026,000,000

Compute:  24 hours

Verify:  16 hours

"Nagisa"
Janurary 31, 2009 January 31, 2009 Source Alexander Yee &
Raymond Chan
Catalan's Constant 15,510,000,000

Compute:  88 hours

Verify:  100 hours

"Nagisa"
Janurary 21, 2009 January 21, 2009 Source Alexander Yee &
Raymond Chan
Zeta(3) - Apery's Constant 15,510,000,000

Compute:  20 hours

Verify:  21 hours

"Nagisa"
Janurary 18, 2009 January 18, 2009 Source Alexander Yee &
Raymond Chan
Euler-Mascheroni Constant 14,922,244,771

Compute:  96 hours

Verify:  134 hours

"Nagisa"
January 7, 2009 Source Alexander Yee &
Raymond Chan
Log(2) 15,500,000,000

Compute:  12.5 hours

Verify:  8.3 hours

"Nagisa"

Note that starting from v0.5.2, the computation limits of the program are no longer locked below the current world records. So barring any bugs, anyone with sufficient resources will be able to break these records.


The Storyline:

2005 - 2006:

The roots of y-cruncher date all the way back to my senior year in high school in my AP Computer Science class.

It started from a class project which was to write a multi-precision arithmetic library in Java that would support addition, subtraction, multiplication, and division.
After the assignment was due, I continued working on the library and named it "BigNumber". Some of the new features that were added were square roots, trig-functions, constants, etc...

 

June - October 2006:

After graduation, I began to take speed seriously. Multiplication was completely rewritten in C and linked back to BigNumber using JNI. This was around the time that I began to realize that parts of "BigNumber" were fairly fast - comparable to Mathematica. In particular, the function for computing the Euler-Mascheroni Constant was faster than that of Mathematica 5. By October, it came to realization that the world record of 108 million digits for the Euler-Mascheroni Constant was in reach.

 

November 2006:

With the goal of breaking the world record of 108 million digits for Euler's Constant in mind, November was spent entirely on implementing and optimizing the algorithms needed for extremely high precision arithmetic. I also upgraded my laptop from 512MB to 1.5GB of ram as that would be the computer that I would use for such a computation.

 

December 2006:

Finals week and with winter break approching, BigNumber was used to compute 116 million digits of Euler's Constant on my laptop for what appeared to be a new world record. The computation ran for 38.5 hours and the verification ran for 48 hours. It required 1.8 GB of memory.

 

Early 2007:

Lots of media attention... As well as a lot of hate mail saying that it was not a world record. (S. Kondo and S. Pagliarulo already had 2 billion digits, but they hadn't announced it.)

During this time I also made a number of minor improvements to BigNumber. Though all work was pretty much halted by April because of the release of a number of new video games.

 

Sometime between November - December 2007:

In the middle of one of my boring lecture classes - Lightbulb!!! The Hybrid NTT algorithm for multiplication was born. This effectively renewed my interest in this area.

 

2008:

BigNumber was rewritten from scratch in C++ and renamed y-cruncher.

("y" is gamma, the symbol for the Euler-Mascheroni Constant - but I still pronounce it as "y")

 

Click to expand this section. (Warning: technical terminology)

 

January 2009 (back from winter break):

With Nagisa back up and running, Raymond and I managed to break the world records for Log(2) and the Euler-Mascheroni Constant. (Main Article)

And with that, we released the first public version of y-cruncher.

 

By the end of the month, we had also taken the world records for Apery's Constant and Catalan's Constant.

No celebration though, since neither of us could legally drink yet...

 

 

Current:

Currently, y-cruncher is just a mere side-hobby. I no longer work on it as much as I did in 2008 - not even close by a long-shot.

Gaming and school-work now take priority.

 

During the months prior to the start of the Pi computation record of 5 trillion digits, I worked on the program quite a bit. But now that it is done and over with, I'm back to my "on and off" pattern of working on the program.

 

Features:

Aside from computing π and other constants, y-cruncher is great for stress testing 64-bit systems with lots of ram.

Download:

Known Issues: (as of current release)

Version History:

Main Page: y-cruncher - Version History

 

Algorithms:

If you're interested in what formulas and algorithms y-cruncher uses:

Main Page: y-cruncher - Language and Algorithms

 




Windows: y-cruncher v0.5.5.9180 (fix 2).zip (5.83 MB)
Linux      : y-cruncher v0.5.5.9187 (fix 2).tar.gz (5.34 MB)
 

System Requirements:
Click here for older versions.


Please do not link directly to the file downloads as there may be newer versions.
Just link to http://www.numberworld.org/y-cruncher/#Download instead. Thanks!

 

 

 

 

 

 

 

 

Performance:

y-cruncher is the first efficient and publicly available Pi-calculator that can sustain a near 100% cpu load on multi-core computers.
There are other multi-threaded Pi-programs that can achieve high cpu usage, but few of them can sustain it through an entire Pi computation.

 

Below is a typical CPU utilization graph of y-cruncher when computing 1 billion digits of Pi across 8 cores.

 

y-cruncher uses less memory than most other Pi-programs. It is also able to multi-thread WITHOUT significantly increasing memory usage.

 

Benchmarks:

Comparison Chart: (Last updated: February 5, 2011)

 

All times in seconds. All times include the time needed to convert the digits to decimal representation.

All benchmarks were done using the fastest binary with the fastest achieved settings for the system they were run on.

v0.5.3 and v0.5.4 are exactly the same speed. So results are directly comparable. v0.5.5 is faster on processors with AVX instructions.

Processor(s): Core 2 Q6600
2.4 GHz
Phenom II X4 940
3.5 GHz
1
Core i7 920
2.80 GHz
2
Core i7 920
4.2 GHz
3
Core i7 980X
4.48 GHz
4
Core i7 2600K
4.8 GHz
5
4 x Opteron
(Barcelona)
2.31 GHz
6
2 x Xeon X5482
(Harpertown)
3.2 GHz
2 x Xeon X5680
(Westmere-EP)
3.46 GHz
7
Cores/Threads: 4/4 4/4 4/8 4/8 6/12 4/8 16/16 8/8 12/24
Number of Digits v0.5.3 v0.5.3 v0.5.4 v0.5.4 v0.5.4 v0.5.5 v0.5.3 v0.5.5 v0.5.3
1,000,000 0.566   0.390 0.259   0.216   0.354  
10,000,000 5.286   3.667 2.466   1.916   3.147  
100,000,000 68.95 48.55 43.60 29.53 22.51 21.54 35.09 29.43 16.29
1,000,000,000 990.0 698.8 619.4 424.3 302.8 311.8 468.1 392.5 202.5
10,000,000,000               5,365 2,721

1Overclocked from 3.0 GHz. Credit to CRFX from XtremeSystems.

2Base frequency is 2.67 GHz. Intel Turbo Boost Technology increases actual operating frequency to 2.8 GHz.

3Overclocked from 2.67 GHz to 4.0 GHz. Actual operating frequency after Turbo Boost is 4.2 GHz.

4Overclocked from 3.33 GHz. Credit to tet5uo from XtremeSystems.

5Base frequency is 3.4 GHz with 3.5 GHz 4-core Turbo Boost. Actual operating frequency is 4.8 GHz by overclocking. Credit to Shigeru Kondo for the 100m and 1b runs. (I haven't had the time to play with my overclock enough to get 4.8 GHz benchable for longer runs.)

6Credit to skycrane from XtremeSystems.

7Base frequency is 3.33 GHz. Intel Turbo Boost Technology increases actual operating frequency to 3.46 GHz. Credit to Shigeru Kondo.

 

 

 

Random Screenshots: (from my test machines)

Pi - 1 billion digits (7 minutes) Pi - 1 billion digits (5 minutes, 33 seconds) Pi - 100 billion digits (28 hours)
Windows 7 - y-cruncher v0.5.4 x64 SSE4.1 Ubuntu Linux 10 - y-cruncher v0.5.5 x64 AVX Windows 7 - y-cruncher v0.5.4 x64 SSE4.1
Core i7 920 @ 4.2 GHz (overclock) Core i7 2600K @ 4.4 GHz (overclock) 2 x Xeon X5482 Harpertown @ 3.2 GHz
12 GB DDR3 @ 1200 MHz 16 GB DDR3 @ 1333 MHz 64 GB DDR2 FB-DIMM @ 800 MHz
    8 x 2 TB Hitachi Deskstar

 

Fastest Times:

(Last updated: September 22, 2011)

 

All times in seconds.

 

Green indicates that the benchmark has been validated.

Red indicates that the benchmark was either not validated, or no validation was provided.

 

In the future, I may decide to allow only validated benchmarks on this list.

As of the current release, only Ram-Only Pi computations done using the Benchmark feature will be validated. However, starting from version 0.5.2, all computations have validation. This includes both swap modes as well as all the other constants.

 

A full chart of rankings for each size can be found here:

Desktop (Limit One Processor)
Digits Time Version Computer Credit
25,000,000 4.278 v0.5.5 x64 AVX Intel Core i7 3930K @ 4.29 GHz 64 GB DDR3 Justin Bronder
50,000,000 8.159 v0.5.5 x64 AVX Intel Core i7 3930K @ 4.29 GHz 64 GB DDR3 Justin Bronder
100,000,000 17.428 v0.5.5 x64 AVX Intel Core i7 3930K @ 4.29 GHz 64 GB DDR3 Justin Bronder
250,000,000 49.033 v0.5.5 x64 AVX Intel Core i7 3930K @ 4.29 GHz 64 GB DDR3 Justin Bronder
500,000,000 105.520 v0.5.5 x64 AVX Intel Core i7 3930K @ 4.20 GHz 32 GB DDR3 Rod Laird
1,000,000,000 233.251 v0.5.5 x64 AVX Intel Core i7 3930K @ 4.20 GHz 32 GB DDR3 Rod Laird
2,500,000,000 899.606 v0.5.4 x64 SSE4.1 Intel Core i7 2600K @ 5.00 GHz 16 GB DDR3 Shigeru Kondo
5,000,000,000 2,145.75 v0.5.4 x64 SSE4.1 Intel Core i7 980X @ 4.00 GHz - on Air 24 GB DDR3 alpha754293 @ XtremeSystems
10,000,000,000 1.765 hours v0.5.4 x64 SSE4.1 Intel Core i7 2600K @ 5.00 GHz
Hard Drives: 4 x ?
16 GB DDR3 Shigeru Kondo
25,000,000,000 6.874 hours v0.5.4 x64 SSE4.1 Intel Core i7 2600K @ 4.50 GHz - on Water
Hard Drives: 1.5 TB + 4 x 1 TB
16 GB DDR3 Alexander Yee
50,000,000,000 22.343 hours v0.5.2 x64 SSE4.1 Intel Core i7 920 @ 3.34 GHz (3.5 GHz Turbo Boost) - on Air
Hard Drives: 4 x 2 TB
12 GB DDR3 Alexander Yee
100,000,000,000 30.984 hours v0.5.5 x64 AVX Intel Core i7 2600K @ 4.40 GHz - on Water
Hard Drives: 1.5 TB + 5 x 1 TB + 5 x 2 TB
16 GB DDR3 Alexander Yee
250,000,000,000 123.982 hours v0.5.5 x64 AVX Intel Core i7 2600K @ 4.40 GHz - on Water
Hard Drives: 1.5 TB + 4 x 1 TB
16 GB DDR3 Alexander Yee
500,000,000,000 - - - - - -
Any Computer (No Processor Limit)
Digits Time Version Computer Credit
25,000,000 3.849 v0.5.3 x64 SSE4.1 2 x Intel Xeon X5680 @ 4.3 GHz 12 GB DDR3 sRHunt3r @ XtremeSystems
50,000,000 7.585 v0.5.3 x64 SSE4.1 2 x Intel Xeon X5680 @ 4.3 GHz 12 GB DDR3 sRHunt3r @ XtremeSystems
100,000,000 14.512 v0.5.3 x64 SSE4.1 2 x Intel Xeon X5680 @ 4.3 GHz 12 GB DDR3 sRHunt3r @ XtremeSystems
250,000,000 38.582 v0.5.3 x64 SSE4.1 2 x Intel Xeon X5680 @ 4.3 GHz 12 GB DDR3 sRHunt3r @ XtremeSystems
500,000,000 79.311 v0.4.4 x64 SSE4.1 2 x Intel Xeon X5680 @ 4.3 GHz 12 GB DDR3 sRHunt3r @ XtremeSystems
1,000,000,000 174.470 v0.4.4 x64 SSE4.1 2 x Intel Xeon X5680 @ 4.3 GHz 12 GB DDR3 sRHunt3r @ XtremeSystems
2,500,000,000 552.673 v0.5.2 x64 SSE4.1 4 x Intel Xeon X7560 @ 2.27 GHz (HT Off) 128 GB DDR3 Daniel Ghidali
5,000,000,000 1,143.750 v0.5.2 x64 SSE4.1 4 x Intel Xeon X7560 @ 2.27 GHz (HT Off) 128 GB DDR3 Daniel Ghidali
10,000,000,000 2,121.99 v0.5.5 x64 SSE3 4 x AMD Opteron 6168 64 GB DDR3 Sheik @ XtremeSystems
25,000,000,000 1.720 hours v0.5.3 x64 SSE4.1 4 x Intel Xeon X7560 @ 2.27 GHz (HT Off) 128 GB DDR3 Daniel Ghidali
50,000,000,000 14.807 hours v0.5.3 x64 SSE4.1 2 x Intel Xeon X5482 @ 3.2 GHz
Hard Drives: 1.5 TB + 4 x 1 TB
64 GB DDR2 Alexander Yee
100,000,000,000 17.283 hours v0.5.3 x64 SSE4.1 2 x Intel Xeon X5650 @ 2.66 GHz
Hard Drives: 16 x 2 TB
144 GB DDR3 Shigeru Kondo
250,000,000,000 83.586 hours v0.5.2 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz
Hard Drives: 8 x 2 TB
144 GB DDR3 Shigeru Kondo
500,000,000,000 172.396 hours v0.5.2 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz
Hard Drives: 16 x 2 TB
144 GB DDR3 Shigeru Kondo
1,000,000,000,000 12.260 days v0.5.3 x64 SSE4.1 2 x Intel Xeon X5650 @ 2.66 GHz
Hard Drives: 16 x 2 TB
96 GB DDR3 Shigeru Kondo
2,500,000,000,000            
5,000,000,000,000 90 days v0.5.4 x64 SSE4.1 2 x Intel Xeon X5680 @ 3.33 GHz
Hard Drives: 16 x 2 TB
96 GB DDR3 Shigeru Kondo
10,000,000,000,000 371 days v0.5.5 x64 SSE4.1 2 x Intel Xeon X5680 @ 3.33 GHz
Hard Drives: 24 x 2 TB
96 GB DDR3 Shigeru Kondo

*These fastest times may include unreleased betas.
Got a faster time? Let me know: a-yee@u.northwestern.edu

FAQ:

Q:  Why does AVX (v0.5.5) only give about 10% speedup over SSE4.1 (v0.5.4)? Shouldn't it be double the speed?
A:  Unlike the majority of compute-intensive applications, y-cruncher does not exclusively use floating-point. As of v0.5.4, only about 30% of a Pi computation is floating-point bound. The remainder of the time is spent on integer operations and stalling on memory access. So cutting that 30% in half yields little overall speedup. Speeding up the code in this manner exposes more memory bottlenecks - which ends up reducing the speedup to only 10%...

Integer operations can be largely be emulated using floating-point (albeit with overhead). But most of the integer work involves carry-propagation, so it is not very vectorizable. For now, integer operations are still faster using the normal integer instructions.

Even without AVX, floating-point is not the dominant factor in performance.
Plans for v0.6.x include improving the integer operations to better utilize x64 capabilities. Memory optimizations are also slated for the future.

Mathematical improvements will be added whenever convenient. These will improve the computational speed of y-cruncher, but will probably have no effect on the resource consumption ratios in y-cruncher. (By "resource consumption ratios" I mean the relative portions of the program with are bound by integer/floating-point/memory.)

 

 

 

Q:  Why is the performance so poor for small computations? The program only gets xx% CPU utilization on my xx core machine for small sizes!!!
A:  The reason is simple. For small computations, there isn't much that can be parallelized. In fact, spawning N threads for an N core machine may actually take longer than the computation itself!!! In these cases, the program will decide not to use all available cores. Therefore, parallelism is really only helpful when there is a lot of work to be done. As of 2010, I am not aware of any Pi-program that achieves perfect parallelism for small computations and is at least half the speed of y-cruncher. (It's easy to get perfect parallelism if you artificially make the task really slow.)

Now here's a layman explanation:

Suppose you had a 1000 page novel that needs to be proof-read. Now suppose you had 10 staff members. To speed up the process, you can assign 100 pages to each staff member. This basically makes the task 10x faster. This is an example of a "large computation" done using a "small" number of cores.

Now suppose you need to proof-read a 1 page paper. And you have a staff of 100 people. Are you going to tear the page into 100 small parts and give one to each person to proof-read? Furthermore, organizing your group of 100 staff members is probably going to take longer than proof reading the entire page yourself!!! This is an example of a "small computation" using a "large" number of cores.

Of course, this is a highly simplified explanation of what is really going on. But the overall idea is the same.
It should be noted that some tasks are easier to parallelize than others. Most of the multi-threaded benchmarks that are used in the overclocking community tend to be highly "synthetic" tasks that are extremely easy to parallelize. These "synthetic" tasks typically achieve perfect parallelism regardless of the size of the task. But most other applications that do any sort of "meaningful" task tend to be less ideal. (And no, I'm not trying to imply that computing Pi is in anyway meaningful.)

 

 

Q:  Who else helped you develop this program?
A:  Although all the development and testing (including all the coding) was done by myself, I have to give a lot of credit to my parents for providing me the resources to acquire the hardware that I've needed.

The total cost of all the hardware that were directly involved in the development of y-cruncher adds up to more than $10,000. Most of this came from my parents. So yes, y-cruncher was developed entirely using my own hardware. (With small amounts of testing on some of my friends' hardware.)

Since then, I've gained back much more than 10-grand in the form of scholarships and job opportunities. (So it can't really be quantified...)
But that's not the point. The y-cruncher project started out as just a really expensive hobby from which we were expecting no return.

It just happened to turn out a little better than expected...

 

Q:  Is there a publicly available library for the multi-threaded arithmetic that y-cruncher uses?
A:  Yes there is, but it's still in a very alpha stage and is far from ready for release.

 

 
Q:  Is there a distributed version that performs better on NUMA and HPC clusters?
A:  Not yet. Pretty much the entire program needs to be redesigned for NUMA and MPI.

For now, a speedup can be gained in Linux by running y-cruncher with interleaved memory: numactl --interleave=all "./x64 SSE3.out"
 
 
Q:  Can you make a CUDA version?
A:  Definitely not for CUDA 1.x. Things are better for CUDA 2.0, but probably still not good enough.

Here are the major reasons:

  1. GPUs are highly vectorized. y-cruncher isn't ready for massive scalable vectorization. The widening of the SIMD vector from 128-bit SSE to 256-bit AVX only produced 10% speedup. Therefore, going to larger vectors with the current set of algorithms will be well into the region of diminishing return. New algorithms and programming techniques will need to be developed to make such a task efficient on GPUs.

     

  2. The memory bandwidth between the GPU and main memory will almost certainly be a bottleneck. This isn't a problem for small computations that will fit entirely into GPU memory, but small computations isn't the point of y-cruncher.

     

  3. There is currently no set-in-stone standard for GP-GPU programming. CUDA is NVidia only. Most other GPU programming standards are, for the most part, too graphics oriented for such a non-graphics task as computing Pi.
 
Q:  How does y-cruncher compare to other programs?
A:  Here's a graph comparing y-cruncher to the two fastest publicly-available Pi programs that I know of. (click to enlarge)

(Huge thanks to Fredrik Kjølstad for compiling GMP for me on my machines!)
Q:  Why does y-cruncher run 4 threads on my 3-core system (8 threads on 6-core, etc...)
A:  This is due to practical restrictions in the algorithms that are used by y-cruncher. Because of the nature of the algorithms that y-cruncher uses, they are most efficiently paralleled when the thread count is a power of two. To deal with systems that don't have a power-of-two number of logical cores, y-cruncher simply rounds up to the next power of two.

The overhead of running extra threads is usually very small. Any load balancing issues that result from awkward thread-to-core ratios are usually resolved by further increasing the thread count. (as explained in the next Q/A)
 
Q:  Why does y-cruncher create more threads than I tell it to use? Because of this I can't get dual-core benchmarks on a quad core machine since it will use all 4 cores even in dual-core mode.
A:  This is by design and is NOT a bug. Because of the nature of some of the algorithms, I find it necessary to spam threads in order to maximize multi-core efficiency. The work-around is to go to "Processor Affinity" in Task Manager and uncheck the cores that you do not want y-cruncher to use. y-cruncher does not do this automatically because it "doesn't know which logical cores are the best to use".

I call this method "Thread Spamming". Yes, it sounds ridiculous. But it's a very simple and effective way to deal with load imbalance.
 
Q:  Is y-cruncher open-sourced?
A:  No.
 
Q:  Who are you? Are you really still in college? What degrees do you have? etc...
A:  About me.

Links:

Here's some interesting sites dedicated to the computation of Pi and other constants:

Special Thanks

Questions or Comments

Contact me via e-mail. I'm pretty good with responding unless it gets caught in my school's junk mail filter.