![]() |
y-cruncher - A Multi-Threaded Pi-Program |
![]() |
From a high-school project that went a little too far...By Alexander J. Yee |
(Last updated: April 21, 2012)
Shortcuts:
The first scalable multi-threaded Pi-benchmark for multi-core systems...
Against the Big Guns...
Faster than SuperPi on single-core...
Faster than PiFast 4.3 on dual-core...
Faster than QuickPi 4.5 on quad-core...
1 billion digits of Pi in 5 minutes on 6-core Core i7 @ 4.26 GHz
See the official XtremeSystems thread for more benchmarks.
Latest Version:
Windows: Version 0.5.5 Build 9180 (fix 2) (Released: April 6, 2011)
Linux : Version 0.5.5 Build 9187 (fix 2) (Released: February 20, 2011)
Starting from v0.5.2, y-cruncher allows Pi computations of up to 10 trillion digits.
y-cruncher has been tested up to the current limit 10 trillion digits. (Credit: Shigeru Kondo)
Due to algorithmic limitations, v0.5.x is not capable of going above 10 trillion digits. (The exact limit is unknown, but is somewhere between 10 - 40 trillion...)
This will be solved in v0.6.1 with the addition of a new multiplication algorithm.
Update on v0.6.1: (April 21, 2012)
No, y-cruncher isn't dead. Sure, it's been almost a year since the last release. See my blog for more details.
World Record Size Computations
| Date Announced | Date Completed: | Source: | Who: | Constant: | Decimal Digits: | Time: | Computer: |
| February 9, 2012 | February 9, 2012 | Alexander Yee | Square Root of 2 | 2,000,000,000,050 | 2 x Xeon X5482 @ 3.2 GHz - 64 GB 8 x 2 TB Core i7 2600K @ 4.4 GHz - 16 GB 5 x 1 TB + 5 x 2 TB |
||
| October 17, 2011 | October 16, 2011 | Source | Shigeru Kondo & Alexander Yee |
Pi | 10,000,000,000,050 | 2 x Intel Xeon X5680 @ 3.33 GHz 96 GB DDR3 @ 1066 MHz 24 x 2 TB |
|
| May 24, 2011 | May 20, 2011 | Shigeru Kondo | Log(10) | 100,000,000,000 | Compute and Verify: |
Intel Core i7 2500K @ 4.8 GHz 16 GB DDR3 + 6 x 1 TB |
|
| May 14, 2011 | Shigeru Kondo | Log(2) | 100,000,000,000 | Compute and Verify: |
Intel Core i7 2500K @ 4.8 GHz 16 GB DDR3 + 6 x 1 TB |
||
| December 13, 2010 | December 12, 2010 | Alex Roberts | Catalan's Constant | 50,000,000,000 | Compute: 35.5 days Not Verified |
AMD Phenom II X4 945 @ 3.0 GHz 8 GB |
|
| December 4, 2010 | November 26, 2010 | Alex Roberts | Log(2) | 50,000,000,000 | Compute: 13.1 days Independently Verified |
Intel Core i7 720Q @ 1.60 GHz 4 GB DDR3 |
|
| September 17, 2010 | September 17, 2010 | Source | Alexander Yee | Zeta(3) - Apery's Constant | 100,000,001,000 | "Nagisa" + "Ushio" | |
| August 2, 2010 | August 2, 2010 | Source | Shigeru Kondo & Alexander Yee |
Pi | 5,000,000,000,000 | 2 x Intel Xeon X5680 @ 3.33 GHz 96 GB DDR3 @ 1066 MHz 16 x 2 TB |
|
| July 8, 2010 | July 8, 2010 | Source | Alexander Yee | Golden Ratio | 1,000,000,000,000 |
*Not a continuous run. |
"Nagisa" 2 x Intel Xeon X5482 @ 3.2 GHz 64 GB DDR2 FB-DIMM 1.5 TB (Boot + Output) 4 x 1 TB (2 x 2 RAID0) + 6 x 2 TB |
| July 5, 2010 | July 5, 2010 | Source | Shigeru Kondo | e | 1,000,000,000,000 | Intel Core i7 980X @ 3.33 GHz 12 GB DDR3 2 TB (Boot + Output) 8 x 1 TB (Computation) |
|
| March 22, 2010 | March 22, 2010 | Source | Shigeru Kondo | Square Root of 2 | 1,000,000,000,000 | Core i7 975 @ 4 GHz - 12GB 8 x 1 TB HDs 2 x Xeon W5590 - 144GB 16 x 2 TB HDs |
|
| February 21, 2010 | February 20, 2010 | Source | Alexander Yee | e | 500,000,000,000 | Compute and Verify: |
"Ushio" |
| April 16, 2009 | April 16, 2009 | Source | Alexander Yee & Raymond Chan |
Catalan's Constant | 31,026,000,000 | Compute: 178 hours Verify: 221 hours |
"Nagisa" |
| March 13, 2009 | March 13, 2009 | Source | Alexander Yee & Raymond Chan |
Euler-Mascheroni Constant | 29,844,489,545 | Compute: 205 hours Verify: 269 hours |
"Nagisa" |
| February 28, 2009 | Source | Alexander Yee & Raymond Chan |
Log(10) | 31,026,000,000 | Compute and Verify: |
"Nagisa" | |
| February 15, 2009 | Source | Alexander Yee & Raymond Chan |
Zeta(3) - Apery's Constant | 31,026,000,000 | Compute: 45 hours Verify: 44 hours |
"Nagisa" | |
| February 4, 2009 | Source | Alexander Yee & Raymond Chan |
Log(2) | 31,026,000,000 | Compute: 24 hours Verify: 16 hours |
"Nagisa" | |
| Janurary 31, 2009 | January 31, 2009 | Source | Alexander Yee & Raymond Chan |
Catalan's Constant | 15,510,000,000 | Compute: 88 hours Verify: 100 hours |
"Nagisa" |
| Janurary 21, 2009 | January 21, 2009 | Source | Alexander Yee & Raymond Chan |
Zeta(3) - Apery's Constant | 15,510,000,000 | Compute: 20 hours Verify: 21 hours |
"Nagisa" |
| Janurary 18, 2009 | January 18, 2009 | Source | Alexander Yee & Raymond Chan |
Euler-Mascheroni Constant | 14,922,244,771 | Compute: 96 hours Verify: 134 hours |
"Nagisa" |
| January 7, 2009 | Source | Alexander Yee & Raymond Chan |
Log(2) | 15,500,000,000 | Compute: 12.5 hours Verify: 8.3 hours |
"Nagisa" |
Note that starting from v0.5.2, the computation limits of the program are no longer locked below the current world records. So barring any bugs, anyone with sufficient resources will be able to break these records.
Aside from computing π and other constants, y-cruncher is great for stress testing 64-bit systems with lots of ram.
Known Issues: (as of current release)
Main Page: y-cruncher - Version History
If you're interested in what formulas and algorithms y-cruncher uses:
Main Page: y-cruncher - Language and Algorithms
y-cruncher is the first efficient and publicly available Pi-calculator that can sustain a near 100% cpu load on multi-core computers.
There are other multi-threaded Pi-programs that can achieve high cpu usage, but few of them can sustain it through an entire Pi computation.
Below is a typical CPU utilization graph of y-cruncher when computing 1 billion digits of Pi across 8 cores.
y-cruncher uses less memory than most other Pi-programs. It is also able to multi-thread WITHOUT significantly increasing memory usage.
Comparison Chart: (Last updated: February 5, 2011)
All times in seconds. All times include the time needed to convert the digits to decimal representation.
All benchmarks were done using the fastest binary with the fastest achieved settings for the system they were run on.
v0.5.3 and v0.5.4 are exactly the same speed. So results are directly comparable. v0.5.5 is faster on processors with AVX instructions.
| Processor(s): | Core 2 Q6600 2.4 GHz |
Phenom II X4 940 3.5 GHz1 |
Core i7 920 2.80 GHz2 |
Core i7 920 4.2 GHz3 |
Core i7 980X 4.48 GHz4 |
Core i7 2600K 4.8 GHz5 |
4 x Opteron (Barcelona) 2.31 GHz6 |
2 x Xeon X5482 (Harpertown) 3.2 GHz |
2 x Xeon X5680 (Westmere-EP) 3.46 GHz7 |
| Cores/Threads: | 4/4 | 4/4 | 4/8 | 4/8 | 6/12 | 4/8 | 16/16 | 8/8 | 12/24 |
| Number of Digits | v0.5.3 | v0.5.3 | v0.5.4 | v0.5.4 | v0.5.4 | v0.5.5 | v0.5.3 | v0.5.5 | v0.5.3 |
| 1,000,000 | 0.566 | 0.390 | 0.259 | 0.216 | 0.354 | ||||
| 10,000,000 | 5.286 | 3.667 | 2.466 | 1.916 | 3.147 | ||||
| 100,000,000 | 68.95 | 48.55 | 43.60 | 29.53 | 22.51 | 21.54 | 35.09 | 29.43 | 16.29 |
| 1,000,000,000 | 990.0 | 698.8 | 619.4 | 424.3 | 302.8 | 311.8 | 468.1 | 392.5 | 202.5 |
| 10,000,000,000 | 5,365 | 2,721 |
1Overclocked from 3.0 GHz. Credit to CRFX from XtremeSystems.
2Base frequency is 2.67 GHz. Intel Turbo Boost Technology increases actual operating frequency to 2.8 GHz.
3Overclocked from 2.67 GHz to 4.0 GHz. Actual operating frequency after Turbo Boost is 4.2 GHz.
4Overclocked from 3.33 GHz. Credit to tet5uo from XtremeSystems.
5Base frequency is 3.4 GHz with 3.5 GHz 4-core Turbo Boost. Actual operating frequency is 4.8 GHz by overclocking. Credit to Shigeru Kondo for the 100m and 1b runs. (I haven't had the time to play with my overclock enough to get 4.8 GHz benchable for longer runs.)
6Credit to skycrane from XtremeSystems.
7Base frequency is 3.33 GHz. Intel Turbo Boost Technology increases actual operating frequency to 3.46 GHz. Credit to Shigeru Kondo.
Random Screenshots: (from my test machines)
(Last updated: September 22, 2011)
All times in seconds.
Green indicates that the benchmark has been validated.
Red indicates that the benchmark was either not validated, or no validation was provided.
In the future, I may decide to allow only validated benchmarks on this list.
As of the current release, only Ram-Only Pi computations done using the Benchmark feature will be validated. However, starting from version 0.5.2, all computations have validation. This includes both swap modes as well as all the other constants.
A full chart of rankings for each size can be found here:
| Desktop (Limit One Processor) | ||||||
| Digits | Time | Version | Computer | Credit | ||
| 25,000,000 | 3.606 | v0.5.5 | x64 AVX | Intel Core i7 3930K @ 4.95 GHz | 16 GB DDR3 | CARDB0ARDfoxx @ XtremeSystems |
| 50,000,000 | 6.888 | v0.5.5 | x64 AVX | Intel Core i7 3930K @ 4.95 GHz | 16 GB DDR3 | CARDB0ARDfoxx @ XtremeSystems |
| 100,000,000 | 14.895 | v0.5.5 | x64 AVX | Intel Core i7 3930K @ 4.95 GHz | 16 GB DDR3 | CARDB0ARDfoxx @ XtremeSystems |
| 250,000,000 | 42.435 | v0.5.5 | x64 AVX | Intel Core i7 3930K @ 4.95 GHz | 16 GB DDR3 | CARDB0ARDfoxx @ XtremeSystems |
| 500,000,000 | 105.520 | v0.5.5 | x64 AVX | Intel Core i7 3930K @ 4.20 GHz | 32 GB DDR3 | Rod Laird |
| 1,000,000,000 | 233.251 | v0.5.5 | x64 AVX | Intel Core i7 3930K @ 4.20 GHz | 32 GB DDR3 | Rod Laird |
| 2,500,000,000 | 628.925 | v0.5.5 | x64 AVX | Intel Core i7 3930K @ 4.20 GHz | 32 GB DDR3 | Rod Laird |
| 5,000,000,000 | 1,369.20 | v0.5.5 | x64 AVX | Intel Core i7 3930K @ 4.20 GHz | 32 GB DDR3 | Rod Laird |
| 10,000,000,000 | 2,802.06 | v0.5.5 | x64 AVX | Intel Core i7 3930K @ 4.60 GHz | 64 GB DDR3 | Duane Lyons |
| 25,000,000,000 | 6.874 hours | v0.5.4 | x64 SSE4.1 | Intel Core i7 2600K @ 4.50 GHz - on Water Hard Drives: 1.5 TB + 4 x 1 TB |
16 GB DDR3 | Alexander Yee |
| 50,000,000,000 | 22.343 hours | v0.5.2 | x64 SSE4.1 | Intel Core i7 920 @ 3.34 GHz (3.5 GHz Turbo Boost) - on Air Hard Drives: 4 x 2 TB |
12 GB DDR3 | Alexander Yee |
| 100,000,000,000 | 30.984 hours | v0.5.5 | x64 AVX | Intel Core i7 2600K @ 4.40 GHz - on Water Hard Drives: 1.5 TB + 5 x 1 TB + 5 x 2 TB |
16 GB DDR3 | Alexander Yee |
| 250,000,000,000 | 123.982 hours | v0.5.5 | x64 AVX | Intel Core i7 2600K @ 4.40 GHz - on Water Hard Drives: 1.5 TB + 4 x 1 TB |
16 GB DDR3 | Alexander Yee |
| 500,000,000,000 | - | - | - | - | - | - |
| Any Computer (No Processor Limit) | ||||||
| Digits | Time | Version | Computer | Credit | ||
| 25,000,000 | 3.849 | v0.5.3 | x64 SSE4.1 | 2 x Intel Xeon X5680 @ 4.3 GHz | 12 GB DDR3 | sRHunt3r @ XtremeSystems |
| 50,000,000 | 7.585 | v0.5.3 | x64 SSE4.1 | 2 x Intel Xeon X5680 @ 4.3 GHz | 12 GB DDR3 | sRHunt3r @ XtremeSystems |
| 100,000,000 | 14.512 | v0.5.3 | x64 SSE4.1 | 2 x Intel Xeon X5680 @ 4.3 GHz | 12 GB DDR3 | sRHunt3r @ XtremeSystems |
| 250,000,000 | 38.582 | v0.5.3 | x64 SSE4.1 | 2 x Intel Xeon X5680 @ 4.3 GHz | 12 GB DDR3 | sRHunt3r @ XtremeSystems |
| 500,000,000 | 79.311 | v0.4.4 | x64 SSE4.1 | 2 x Intel Xeon X5680 @ 4.3 GHz | 12 GB DDR3 | sRHunt3r @ XtremeSystems |
| 1,000,000,000 | 174.470 | v0.4.4 | x64 SSE4.1 | 2 x Intel Xeon X5680 @ 4.3 GHz | 12 GB DDR3 | sRHunt3r @ XtremeSystems |
| 2,500,000,000 | 552.673 | v0.5.2 | x64 SSE4.1 | 4 x Intel Xeon X7560 @ 2.27 GHz (HT Off) | 128 GB DDR3 | Daniel Ghidali |
| 5,000,000,000 | 1,143.750 | v0.5.2 | x64 SSE4.1 | 4 x Intel Xeon X7560 @ 2.27 GHz (HT Off) | 128 GB DDR3 | Daniel Ghidali |
| 10,000,000,000 | 2,121.99 | v0.5.5 | x64 SSE3 | 4 x AMD Opteron 6168 | 64 GB DDR3 | Sheik @ XtremeSystems |
| 25,000,000,000 | 1.720 hours | v0.5.3 | x64 SSE4.1 | 4 x Intel Xeon X7560 @ 2.27 GHz (HT Off) | 128 GB DDR3 | Daniel Ghidali |
| 50,000,000,000 | 14.807 hours | v0.5.3 | x64 SSE4.1 | 2 x Intel Xeon X5482 @ 3.2 GHz Hard Drives: 1.5 TB + 4 x 1 TB |
64 GB DDR2 | Alexander Yee |
| 100,000,000,000 | 17.283 hours | v0.5.3 | x64 SSE4.1 | 2 x Intel Xeon X5650 @ 2.66 GHz Hard Drives: 16 x 2 TB |
144 GB DDR3 | Shigeru Kondo |
| 250,000,000,000 | 83.586 hours | v0.5.2 | x64 SSE4.1 | 2 x Intel Xeon W5590 @ 3.33 GHz Hard Drives: 8 x 2 TB |
144 GB DDR3 | Shigeru Kondo |
| 500,000,000,000 | 172.396 hours | v0.5.2 | x64 SSE4.1 | 2 x Intel Xeon W5590 @ 3.33 GHz Hard Drives: 16 x 2 TB |
144 GB DDR3 | Shigeru Kondo |
| 1,000,000,000,000 | 12.260 days | v0.5.3 | x64 SSE4.1 | 2 x Intel Xeon X5650 @ 2.66 GHz Hard Drives: 16 x 2 TB |
96 GB DDR3 | Shigeru Kondo |
| 2,500,000,000,000 | ||||||
| 5,000,000,000,000 | 90 days | v0.5.4 | x64 SSE4.1 | 2 x Intel Xeon X5680 @ 3.33 GHz Hard Drives: 16 x 2 TB |
96 GB DDR3 | Shigeru Kondo |
| 10,000,000,000,000 | 371 days | v0.5.5 | x64 SSE4.1 | 2 x Intel Xeon X5680 @ 3.33 GHz Hard Drives: 24 x 2 TB |
96 GB DDR3 | Shigeru Kondo |
*These fastest times may include unreleased betas.
Got a faster time? Let me know: a-yee@u.northwestern.edu
Q: Why does AVX (v0.5.5) only give about 10% speedup over SSE4.1 (v0.5.4)? Shouldn't it be double the speed?
A: Unlike the majority of compute-intensive applications, y-cruncher does not exclusively use floating-point. As of v0.5.4, only about 30% of a Pi computation is floating-point bound. The remainder of the time is spent on integer operations and stalling on memory access. So cutting that 30% in half yields little overall speedup. Speeding up the code in this manner exposes more memory bottlenecks - which ends up reducing the speedup to only 10%...
Integer operations can be largely be emulated using floating-point (albeit with overhead). But most of the integer work involves carry-propagation, so it is not very vectorizable. For now, integer operations are still faster using the normal integer instructions.
Even without AVX, floating-point is not the dominant factor in performance.
Plans for v0.6.x include improving the integer operations to better utilize x64 capabilities. Memory optimizations are also slated for the future.
Mathematical improvements will be added whenever convenient. These will improve the computational speed of y-cruncher, but will probably have no effect on the resource consumption ratios in y-cruncher. (By "resource consumption ratios" I mean the relative portions of the program with are bound by integer/floating-point/memory.)
Q: Why is the performance so poor for small computations? The program only gets xx% CPU utilization on my xx core machine for small sizes!!!
A: The reason is simple. For small computations, there isn't much that can be parallelized. In fact, spawning N threads for an N core machine may actually take longer than the computation itself!!! In these cases, the program will decide not to use all available cores. Therefore, parallelism is really only helpful when there is a lot of work to be done. As of 2010, I am not aware of any Pi-program that achieves perfect parallelism for small computations and is at least half the speed of y-cruncher. (It's easy to get perfect parallelism if you artificially make the task really slow.)
Now here's a layman explanation:
Suppose you had a 1000 page novel that needs to be proof-read. Now suppose you had 10 staff members. To speed up the process, you can assign 100 pages to each staff member. This basically makes the task 10x faster. This is an example of a "large computation" done using a "small" number of cores.
Now suppose you need to proof-read a 1 page paper. And you have a staff of 100 people. Are you going to tear the page into 100 small parts and give one to each person to proof-read? Furthermore, organizing your group of 100 staff members is probably going to take longer than proof reading the entire page yourself!!! This is an example of a "small computation" using a "large" number of cores.
Of course, this is a highly simplified explanation of what is really going on. But the overall idea is the same.
It should be noted that some tasks are easier to parallelize than others. Most of the multi-threaded benchmarks that are used in the overclocking community tend to be highly "synthetic" tasks that are extremely easy to parallelize. These "synthetic" tasks typically achieve perfect parallelism regardless of the size of the task. But most other applications that do any sort of "meaningful" task tend to be less ideal. (And no, I'm not trying to imply that computing Pi is in anyway meaningful.)
Q: Who else helped you develop this program?
A: Although all the development and testing (including all the coding) was done by myself, I have to give a lot of credit to my parents for providing me the resources to acquire the hardware that I've needed.
The total cost of all the hardware that were directly involved in the development of y-cruncher adds up to more than $10,000. Most of this came from my parents. So yes, y-cruncher was developed entirely using my own hardware. (With small amounts of testing on some of my friends' hardware.)
Since then, I've gained back much more than 10-grand in the form of scholarships and job opportunities. (So it can't really be quantified...)
But that's not the point. The y-cruncher project started out as just a really expensive hobby from which we were expecting no return.
It just happened to turn out a little better than expected...
|
![]() |
![]() |
For those who are interested more at the internals than Pi, here are graphs comparing y-cruncher's multiplication to GMP and TachusPi. The data for these graphs were taken from the "Final Multiply" step of the Pi computations that were done above.
![]() |
![]() |
Here's some interesting sites dedicated to the computation of Pi and other constants:
Contact me via e-mail. I'm pretty good with responding unless it gets caught in my school's junk mail filter.