y-cruncher - A Multi-Threaded Pi-Program

From a high-school project that went a little too far...

By Alexander J. Yee

It Stands at 10 trillion digits of Pi...
World Record for both Desktop and Supercomputer!!!

 

(Last updated: April 15, 2013)

 

Shortcuts:

 

The first scalable multi-threaded Pi-benchmark for multi-core systems...

 

Against the Big Guns...

Faster than SuperPi on single-core...

Faster than PiFast 4.3 on dual-core...

Faster than QuickPi 4.5 on quad-core...

 

1 billion digits of Pi in 5 minutes on 6-core Core i7 @ 4.26 GHz

 

See the official XtremeSystems thread for more benchmarks.

 

Latest Alpha Release:

Windows: Version 0.6.1 Build 9282 (Released: February 17, 2013)

Linux      : This is gonna take a while. I'm graduating and relocating, so I'm pretty busy atm...

 

Stable Release:

Windows: Version 0.5.5 Build 9180 (fix 2) (Released: April 6, 2011)

Linux      : Version 0.5.5 Build 9187 (fix 2) (Released: February 20, 2011)

 

y-cruncher v0.5.5 has been tested up to 10 trillion digits. (Credit: Shigeru Kondo)

 

Due to algorithmic limitations, v0.5.x is not capable of going above 10 trillion digits. (The exact limit is unknown, but is somewhere between 10 - 40 trillion...)

This will be solved in v0.6.x with the addition of a new multiplication algorithm.

 

 

Update on v0.6.1: (December 22, 2012)

 

Version 0.6.1 is almost ready for release. I've already been floating around a few private builds and all is good so far.

All that's needed before I can release it publicly are the anti-cheating mechanisms. See the version history for the list of changes.

 

That said, v0.6.1 is not a complete release. Many of the intended features are still missing (most notably the swap modes). But since everything relevant to benchmarking is complete, I figure it's time to release it to break the 2-year drought of releases.

 

I'll start looking for an AMD machine with XOP instructions shortly after I release v0.6.1. I want to make sure all the existing code is solid before I move on to that.

 

 

Update on v0.6.1: (April 21, 2012)

 

No, y-cruncher isn't dead. Sure, it's been almost a year since the last release. See my blog for more details.

 

 

Records Set by y-cruncher:

World Record Size Computations

Date Announced Date Completed: Source: Who: Constant: Decimal Digits: Time: Computer:
February 9, 2012 February 9, 2012   Alexander Yee Square Root of 2 2,000,000,000,050

Compute:  110 hours
Verify:  119 hours

2 x Xeon X5482 @ 3.2 GHz - 64 GB
8 x 2 TB
Core i7 2600K @ 4.4 GHz - 16 GB
5 x 1 TB + 5 x 2 TB
October 17, 2011 October 16, 2011 Source Shigeru Kondo &
Alexander Yee
Pi 10,000,000,000,050

Compute:  371 days

Verify:  45 hours

2 x Intel Xeon X5680 @ 3.33 GHz
96 GB DDR3 @ 1066 MHz
24 x 2 TB
May 24, 2011 May 20, 2011   Shigeru Kondo Log(10) 100,000,000,000

Compute and Verify:
142 hours

Intel Core i7 2500K @ 4.8 GHz
16 GB DDR3 + 6 x 1 TB
May 14, 2011   Shigeru Kondo Log(2) 100,000,000,000

Compute and Verify:
142 hours

Intel Core i7 2500K @ 4.8 GHz
16 GB DDR3 + 6 x 1 TB
December 13, 2010 December 12, 2010   Alex Roberts Catalan's Constant 50,000,000,000

Compute:  35.5 days

Not Verified

AMD Phenom II X4 945 @ 3.0 GHz
8 GB
December 4, 2010 November 26, 2010   Alex Roberts Log(2) 50,000,000,000

Compute:  13.1 days

Independently Verified

Intel Core i7 720Q @ 1.60 GHz
4 GB DDR3
September 17, 2010 September 17, 2010 Source Alexander Yee Zeta(3) - Apery's Constant 100,000,001,000

Compute:  148 hours

Verify:  366 hours

"Nagisa" + "Ushio"
August 2, 2010 August 2, 2010 Source Shigeru Kondo &
Alexander Yee
Pi 5,000,000,000,000

Compute:  90 days

Verify:  64 hours

2 x Intel Xeon X5680 @ 3.33 GHz
96 GB DDR3 @ 1066 MHz
16 x 2 TB
July 8, 2010 July 8, 2010 Source Alexander Yee Golden Ratio 1,000,000,000,000

Compute:  114 hours

Verify:  ~7 days*

*Not a continuous run.

"Nagisa"
2 x Intel Xeon X5482 @ 3.2 GHz
64 GB DDR2 FB-DIMM
1.5 TB (Boot + Output)
4 x 1 TB (2 x 2 RAID0) + 6 x 2 TB
July 5, 2010 July 5, 2010 Source Shigeru Kondo e 1,000,000,000,000

Compute: 224 hours

Verify: 219 hours

Intel Core i7 980X @ 3.33 GHz
12 GB DDR3
2 TB (Boot + Output)
8 x 1 TB (Computation)
March 22, 2010 March 22, 2010 Source Shigeru Kondo Square Root of 2 1,000,000,000,000

Compute:  193 hours

Verify:  98 hours

Core i7 975 @ 4 GHz - 12GB
8 x 1 TB HDs
2 x Xeon W5590 - 144GB
16 x 2 TB HDs
February 21, 2010 February 20, 2010 Source Alexander Yee e 500,000,000,000

Compute and Verify:
307 hours

"Ushio"
April 16, 2009 April 16, 2009 Source Alexander Yee &
Raymond Chan
Catalan's Constant 31,026,000,000

Compute:  178 hours

Verify:  221 hours

"Nagisa"
March 13, 2009 March 13, 2009 Source Alexander Yee &
Raymond Chan
Euler-Mascheroni Constant 29,844,489,545

Compute:  205 hours

Verify:  269 hours

"Nagisa"
February 28, 2009 Source Alexander Yee &
Raymond Chan
Log(10) 31,026,000,000

Compute and Verify:
40 hours

"Nagisa"
February 15, 2009 Source Alexander Yee &
Raymond Chan
Zeta(3) - Apery's Constant 31,026,000,000

Compute:  45 hours

Verify:  44 hours

"Nagisa"
February 4, 2009 Source Alexander Yee &
Raymond Chan
Log(2) 31,026,000,000

Compute:  24 hours

Verify:  16 hours

"Nagisa"
Janurary 31, 2009 January 31, 2009 Source Alexander Yee &
Raymond Chan
Catalan's Constant 15,510,000,000

Compute:  88 hours

Verify:  100 hours

"Nagisa"
Janurary 21, 2009 January 21, 2009 Source Alexander Yee &
Raymond Chan
Zeta(3) - Apery's Constant 15,510,000,000

Compute:  20 hours

Verify:  21 hours

"Nagisa"
Janurary 18, 2009 January 18, 2009 Source Alexander Yee &
Raymond Chan
Euler-Mascheroni Constant 14,922,244,771

Compute:  96 hours

Verify:  134 hours

"Nagisa"
January 7, 2009 Source Alexander Yee &
Raymond Chan
Log(2) 15,500,000,000

Compute:  12.5 hours

Verify:  8.3 hours

"Nagisa"

Note that starting from v0.5.2, the computation limits of the program are no longer locked below the current world records. So barring any bugs, anyone with sufficient resources will be able to break these records.

 

Features:

Aside from computing π and other constants, y-cruncher is great for stress testing 64-bit systems with lots of ram.

Download:

Known Issues: (as of current release)

Version History:

Main Page: y-cruncher - Version History

 

Algorithms:

If you're interested in what formulas and algorithms y-cruncher uses:

Main Page: y-cruncher - Language and Algorithms

 



 

Note that version 0.6.1 is out! However, it is still a work-in-progress. Many of the features (such as the swap modes) are still missing.

 

If you wish to use those features, you'll need to go back to v0.5.5 for now.

Windows: y-cruncher v0.6.1.9282.zip (4.50 MB)

Windows: y-cruncher v0.5.5.9180 (fix 2).zip (5.83 MB)
Linux      : y-cruncher v0.5.5.9187 (fix 2).tar.gz (5.34 MB)
 

System Requirements (v0.6.1):

System Requirements (v0.5.5):
Click here for older versions.


Please do not link directly to the file downloads as there may be newer versions.
Just link to http://www.numberworld.org/y-cruncher/#Download instead. Thanks!

 

 

 

 

 

 

 

 

Performance:

y-cruncher is the first efficient and publicly available Pi-calculator that can sustain a near 100% cpu load on multi-core computers.
There are other multi-threaded Pi-programs that can achieve high cpu usage, but few of them can sustain it through an entire Pi computation.

 

Below is a typical CPU utilization graph of y-cruncher when computing 1 billion digits of Pi across 8 cores.

 

y-cruncher uses less memory than most other Pi-programs. It is also able to multi-thread WITHOUT significantly increasing memory usage.

 

Benchmarks:

Comparison Chart: (Last updated: February 5, 2011)

 

All times in seconds. All times include the time needed to convert the digits to decimal representation.

All benchmarks were done using the fastest binary with the fastest achieved settings for the system they were run on.

v0.5.3 and v0.5.4 are exactly the same speed. So results are directly comparable. v0.5.5 is faster on processors with AVX instructions.

Processor(s): Core 2 Q6600
2.4 GHz
Phenom II X4 940
3.5 GHz
1
Core i7 920
2.80 GHz
2
Core i7 920
4.2 GHz
3
Core i7 980X
4.48 GHz
4
Core i7 2600K
4.8 GHz
5
4 x Opteron
(Barcelona)
2.31 GHz
6
2 x Xeon X5482
(Harpertown)
3.2 GHz
2 x Xeon X5680
(Westmere-EP)
3.46 GHz
7
Cores/Threads: 4/4 4/4 4/8 4/8 6/12 4/8 16/16 8/8 12/24
Number of Digits v0.5.3 v0.5.3 v0.5.4 v0.5.4 v0.5.4 v0.5.5 v0.5.3 v0.5.5 v0.5.3
1,000,000 0.566   0.390 0.259   0.216   0.354  
10,000,000 5.286   3.667 2.466   1.916   3.147  
100,000,000 68.95 48.55 43.60 29.53 22.51 21.54 35.09 29.43 16.29
1,000,000,000 990.0 698.8 619.4 424.3 302.8 311.8 468.1 392.5 202.5
10,000,000,000               5,365 2,721

1Overclocked from 3.0 GHz. Credit to CRFX from XtremeSystems.

2Base frequency is 2.67 GHz. Intel Turbo Boost Technology increases actual operating frequency to 2.8 GHz.

3Overclocked from 2.67 GHz to 4.0 GHz. Actual operating frequency after Turbo Boost is 4.2 GHz.

4Overclocked from 3.33 GHz. Credit to tet5uo from XtremeSystems.

5Base frequency is 3.4 GHz with 3.5 GHz 4-core Turbo Boost. Actual operating frequency is 4.8 GHz by overclocking. Credit to Shigeru Kondo for the 100m and 1b runs. (I haven't had the time to play with my overclock enough to get 4.8 GHz benchable for longer runs.)

6Credit to skycrane from XtremeSystems.

7Base frequency is 3.33 GHz. Intel Turbo Boost Technology increases actual operating frequency to 3.46 GHz. Credit to Shigeru Kondo.

 

 

 

Random Screenshots: (from my test machines)

Pi - 1 billion digits (7 minutes) Pi - 1 billion digits (5 minutes, 33 seconds) Pi - 100 billion digits (28 hours)
Windows 7 - y-cruncher v0.5.4 x64 SSE4.1 Ubuntu Linux 10 - y-cruncher v0.5.5 x64 AVX Windows 7 - y-cruncher v0.5.4 x64 SSE4.1
Core i7 920 @ 4.2 GHz (overclock) Core i7 2600K @ 4.4 GHz (overclock) 2 x Xeon X5482 Harpertown @ 3.2 GHz
12 GB DDR3 @ 1200 MHz 16 GB DDR3 @ 1333 MHz 64 GB DDR2 FB-DIMM @ 800 MHz
    8 x 2 TB Hitachi Deskstar

 

Fastest Times:

(Last updated: November 18, 2012)

 

The full chart of rankings for each size can be found here:

*These fastest times may include unreleased betas.
Got a faster time? Let me know: a-yee@u.northwestern.edu

FAQ:

Q:  Why does AVX (v0.5.5) only give about 10% speedup over SSE4.1 (v0.5.4)? Shouldn't it be double the speed?
A:  Unlike the majority of compute-intensive applications, y-cruncher does not exclusively use floating-point. As of v0.5.4, only about 30% of a Pi computation is floating-point bound. The remainder of the time is spent on integer operations and stalling on memory access. So cutting that 30% in half yields little overall speedup. Speeding up the code in this manner exposes more memory bottlenecks - which ends up reducing the speedup to only 10%...

Integer operations can be largely be emulated using floating-point (albeit with overhead). But most of the integer work involves carry-propagation, so it is not very vectorizable. For now, integer operations are still faster using the normal integer instructions.

Even without AVX, floating-point is not the dominant factor in performance.
Plans for v0.6.x include improving the integer operations to better utilize x64 capabilities. Memory optimizations are also slated for the future.

Mathematical improvements will be added whenever convenient. These will improve the computational speed of y-cruncher, but will probably have no effect on the resource consumption ratios in y-cruncher. (By "resource consumption ratios" I mean the relative portions of the program with are bound by integer/floating-point/memory.)

 

 

 

Q:  Why is the performance so poor for small computations? The program only gets xx% CPU utilization on my xx core machine for small sizes!!!
A:  The reason is simple. For small computations, there isn't much that can be parallelized. In fact, spawning N threads for an N core machine may actually take longer than the computation itself!!! In these cases, the program will decide not to use all available cores. Therefore, parallelism is really only helpful when there is a lot of work to be done. As of 2010, I am not aware of any Pi-program that achieves perfect parallelism for small computations and is at least half the speed of y-cruncher. (It's easy to get perfect parallelism if you artificially make the task really slow.)

Now here's a layman explanation:

Suppose you had a 1000 page novel that needs to be proof-read. Now suppose you had 10 staff members. To speed up the process, you can assign 100 pages to each staff member. This basically makes the task 10x faster. This is an example of a "large computation" done using a "small" number of cores.

Now suppose you need to proof-read a 1 page paper. And you have a staff of 100 people. Are you going to tear the page into 100 small parts and give one to each person to proof-read? Furthermore, organizing your group of 100 staff members is probably going to take longer than proof reading the entire page yourself!!! This is an example of a "small computation" using a "large" number of cores.

Of course, this is a highly simplified explanation of what is really going on. But the overall idea is the same.
It should be noted that some tasks are easier to parallelize than others. Most of the multi-threaded benchmarks that are used in the overclocking community tend to be highly "synthetic" tasks that are extremely easy to parallelize. These "synthetic" tasks typically achieve perfect parallelism regardless of the size of the task. But most other applications that do any sort of "meaningful" task tend to be less ideal. (And no, I'm not trying to imply that computing Pi is in anyway meaningful.)

 

 

Q:  Who else helped you develop this program?
A:  Although all the development and testing (including all the coding) was done by myself, I have to give a lot of credit to my parents for providing me the resources to acquire the hardware that I've needed.

The total cost of all the hardware that were directly involved in the development of y-cruncher adds up to more than $10,000. Most of this came from my parents. So yes, y-cruncher was developed entirely using my own hardware. (With small amounts of testing on some of my friends' hardware.)

Since then, I've gained back much more than 10-grand in the form of scholarships and job opportunities. (So it can't really be quantified...)
But that's not the point. The y-cruncher project started out as just a really expensive hobby from which we were expecting no return.

It just happened to turn out a little better than expected...

 

Q:  Is there a publicly available library for the multi-threaded arithmetic that y-cruncher uses?
A:  Yes there is, but it's still in a very alpha stage and is far from ready for release.

 

 
Q:  Is there a distributed version that performs better on NUMA and HPC clusters?
A:  Not yet. Pretty much the entire program needs to be redesigned for NUMA and MPI.

For now, a speedup can be gained in Linux by running y-cruncher with interleaved memory: numactl --interleave=all "./x64 SSE3.out"
 
 
Q:  Can you make a CUDA version?
A:  Definitely not for CUDA 1.x. Things are better for CUDA 2.0, but probably still not good enough.

Here are the major reasons:

  1. GPUs are highly vectorized. y-cruncher isn't ready for massive scalable vectorization. The widening of the SIMD vector from 128-bit SSE to 256-bit AVX only produced 10% speedup. Therefore, going to larger vectors with the current set of algorithms will be well into the region of diminishing return. New algorithms and programming techniques will need to be developed to make such a task efficient on GPUs.

     

  2. The memory bandwidth between the GPU and main memory will almost certainly be a bottleneck. This isn't a problem for small computations that will fit entirely into GPU memory, but small computations isn't the point of y-cruncher.

     

  3. There is currently no set-in-stone standard for GP-GPU programming. CUDA is NVidia only. Most other GPU programming standards are, for the most part, too graphics oriented for such a non-graphics task as computing Pi.
 
Q:  How does y-cruncher compare to other programs?
A:  Here's a graph comparing y-cruncher to the two fastest publicly-available Pi programs that I know of. (click to enlarge)

(Huge thanks to Fredrik Kjølstad for compiling GMP for me on my machines!)
Q:  Why does y-cruncher run 4 threads on my 3-core system (8 threads on 6-core, etc...)
A:  This is due to practical restrictions in the algorithms that are used by y-cruncher. Because of the nature of the algorithms that y-cruncher uses, they are most efficiently paralleled when the thread count is a power of two. To deal with systems that don't have a power-of-two number of logical cores, y-cruncher simply rounds up to the next power of two.

The overhead of running extra threads is usually very small. Any load balancing issues that result from awkward thread-to-core ratios are usually resolved by further increasing the thread count. (as explained in the next Q/A)
 
Q:  Why does y-cruncher create more threads than I tell it to use? Because of this I can't get dual-core benchmarks on a quad core machine since it will use all 4 cores even in dual-core mode.
A:  This is by design and is NOT a bug. Because of the nature of some of the algorithms, I find it necessary to spam threads in order to maximize multi-core efficiency. The work-around is to go to "Processor Affinity" in Task Manager and uncheck the cores that you do not want y-cruncher to use. y-cruncher does not do this automatically because it "doesn't know which logical cores are the best to use".

I call this method "Thread Spamming". Yes, it sounds ridiculous. But it's a very simple and effective way to deal with load imbalance.
 
Q:  Is y-cruncher open-sourced?
A:  No.
 
Q:  Who are you?
A:  About me.

Links:

Here's some interesting sites dedicated to the computation of Pi and other constants:

Special Thanks

Questions or Comments

Contact me via e-mail. I'm pretty good with responding unless it gets caught in my school's junk mail filter.