y-cruncher - Frequently Asked Questions

(Last updated: September 24, 2017)

 

 

Back To:

 

Pi and other Constants:

 

Where can I download the digits for Pi and other constants that are featured in the various record computations?

 

There is currently no reliable place to get the digits. Due to the large sizes of the data, it simply isn't feasible to move them around.

 

Personally, I have an archive of some of the digits including all 22.4 trillion digits of Pi that have been computed. But because I'm on a consumer internet connection with rate and possibly bandwidth limits, it simply isn't possible for me to upload them. When I was still in school, I was able to seed some torrents since university connections are very fast. But that isn't possible anymore.

 

To answer the question directly, your best bet at getting the digits is to:

  1. Compute them locally using y-cruncher if you have the resources to do so.
  2. Contact the person who ran the computation and see if they still have the digits. (In all seriousness, this probably won't work since everyone seems to just delete the digits after the computation since they take too much space.)

Under extreme circumstances (if you're a famous professor trying to do research), I may make special arrangements to run research code locally on my machine. But this has happened only once and I was dealing with some pretty amazing professors which I didn't want to let down.

 

 

It's worth noting that Bittorrent is a viable way to distribute large files like the digits of Pi. And this is something I actually did for about 2 years while I was in school. But I can't really do that anymore because US-based consumer ISPs suck and they will come after you if you use too much bandwidth.

 


Can you search for XXX in the digits of Pi for me?

 

In short no. Not because I don't want to, but because I can't. 22.4 trillion digits is a lot of digits. It takes several days just to read all of it. So searching requires too much time and computational resources. I'm not Google and I don't have the ability to search index something that large.

 

 


Can you add support for more constants? I want to compute ePi, Khinchin, Glaisher, etc...

 

In short no. The goal of y-cruncher is not to be the jack of all trades, but to focus on a small number of major constants and take them to the extreme.

 

Because of this emphasis on specialization, adding a new constant is not as simple as plugging in a new formula and pushing a button. In other words, there are technical and practical barriers to adding support for arbitrary constants or functions.

 

From the technical perspective, all constants need to be computable to N digits in quasi-linear time and linear memory. This automatically rules out a large number of requests that I get for new constants. Hard to compute things like Khinchin's Constant will never be computable to billions of digits unless someone discovers a suitable algorithm for it. For stuff like this, the y-cruncher project is the wrong tool for the job since it's specialized for billions and trillions of digits.

 

From the practical side, the issue is mostly a matter implementation and maintainance costs. The more stuff there is, the more you need to do. Furthermore, most of the constants that are currently supported by y-cruncher are computable using a very small number of specially optimized subroutines. Anything that needs more than that will be a lot of work. This is why the Lemniscate constant uses the ArcSinlemn formulas instead of the AGM. The AGM is faster, but y-cruncher has no support for a fully generic square root function. And Lemniscate is "not important enough" to justify adding such support.

 

As far as "plugging in formulas" goes, the easiest way is to use something like Mathematica since it's literally built for this purpose. If you need more performance, or if you need to reach sizes that are larger than what Mathematica can handle, you can try out the NumberFactory/YMP project. It's a partially open-sourced C++ project that exposes y-cruncher's parallelized bignum arithmetic. But of course, you'll need experience in C++ to be able to use it.

 

 

 

 

Hardware and Overclocking:

 

My computer is completely stable. But I keep getting errors such as, "Redundancy Check Failed: Coefficient is too large."

 

The notorious "Coefficient is too large" error is a common error that can be caused by many things. A full technical explanation is here.

 

Because of the nature of error, it can be caused by literally anything. But below are the two most common causes.

 

 

The hardware is not "AVX-stable:

 

If your "stable" overclock is hitting the "Coefficient is too large" error on the standard benchmark sizes, then your overclock is not as stable as you think it is.

 

This error is most commonly seen on Haswell processors that are overclocked. y-cruncher makes heavy use of AVX instructions if the processor supports them. So in order to successfully run a y-cruncher benchmark, your system needs to be AVX-stable. The problem is that the vast majority of programs don't use AVX. So many "rock-stable" overclocks are actually not stable when running AVX.

 

If you search around overclocking forums, you'll find out that Haswell processors are notorious for generating a tremendous amount of heat when running AVX workloads like the latest Prime95. And for that reason, many overclockers will skip these stress-tests calling them "unrealistic". While this allows for significantly higher overclocks, it sacrifices stability for AVX-optimized applications. So it's common for overclocked systems to be perfectly stable for months, and then immediately crash and burn when attempting to run y-cruncher or the latest Prime95.

 

If you fall into this category, lower your overclock. While this is most commonly seen on Haswell, it has also been observed on later Intel processors as well. Intel Processors starting from Kaby Lake and Haswell-E will run AVX at lower frequencies to counter this extra stress that AVX generates.

 

While y-cruncher isn't quite as stressful as latest Prime95, the workload is very similar. So if you cannot pass Prime95 small FFTs (with AVX) for at least a few seconds, you stand no chance of running any large benchmark or computation with y-cruncher.

 

 

Memory Instability:

 

The other common cause is memory instability. y-cruncher has a reputation of being notoriously stressful on the memory subsystem. If you read around overclocking forums, there are countless reports of y-cruncher being able to uncover subtle memory instabilities where all other applications and stress-tests will pass.

 

This doesn't just happen with overclocking. It has been observed on server hardware as well. To date, I've had 4 bug reports on server hardware:

In all 4 of these cases, the users reported that y-cruncher is the only application that managed to fail. Personally, I've had one error on server hardware believed to be caused by an overheating northbridge.

 

Naturally, one would do a double take since server hardware has ECC memory. But ECC only protects against bit flips in the memory. It doesn't guard against instability in the memory controller or general memory incompatibilities.

 

It's worth noting that y-cruncher was never designed to be a memory stress-test. And to date, it is still unclear why it often seems to be better at testing memory than even dedicated tests such as MemTest64.

 

 


Why is y-cruncher so much slower on AMD processors than Intel processors?

 

y-cruncher is more than 2x slower on AMD Bulldozer processors compared to Intel Haswell and Skylake. And AMD's Zen processor is still about 50% slower than Intel's 8-core Haswell and Broadwell HEDT processors. What's going on here? Is y-cruncher another one of those "Intel-biased" benchmarks?

 

 

Short Answer:

 

It boils down to the raw SIMD throughput. Intel processors have 256-bit and 512-bit wide vector units while AMD processors are only 128-bits wide.

y-cruncher is one of the few applications that can utilize the AVX to get the full benefit of the vector units. This gives Intel processors a very large advantage.

 

 

Long Answer:

 

Here's a data dump of the performance throughput of various vector operations. Higher is better.

For unreleased processors, the pink entries are educated guesses based on publicly released information at the time of writing.

Hardware Throughput Per Cycle
Processor Year Unit Vector Width Vector Units Floating-Point Integer Shuffle
Add Mul Add or Mul FMA Add Mul Logic
Intel Sandy Bridge 2011 1 core 256 bit 3 1 x 256 1 x 256 2 x 256   2 x 128 1 x 128 3 x 128

2 x 128

1 x 256

Intel Ivy Bridge 2012
Intel Haswell 2013 1 core 256 bit 3 1 x 256 2 x 256 2 x 256 2 x 256 2 x 256 1 x 256 3 x 256 1 x 256
Intel Broadwell 2014
Intel Skylake 2015 1 core 256 bit 3 2 x 256 2 x 256 2 x 256 2 x 256 3 x 256 2 x 256 3 x 256 1 x 256
Intel Kaby Lake 2017
Intel Knights Landing 2016 1 core 512 bit 2 2 x 512 2 x 512 2 x 512 2 x 512 2 x 512 2 x 512 2 x 512 1 x 512
Intel Skylake Purley 2017 1 core 256-bit 3 2 x 256 2 x 256 2 x 256 2 x 256 3 x 256 2 x 256 3 x 256 1 x 256
512 bit 2 (1 x FMA) 1 x 512 1 x 512 1 x 512 1 x 512

2 x 512

1 x 512 2 x 512 1 x 512

2 (2 x FMA)

2 x 512 2 x 512 2 x 512 2 x 512 2 x 512 2 x 512 2 x 512 1 x 512
Intel Cannonlake 2017 1 core 512 bit 2 or 3* 2 x 512 2 x 512 2 x 512 2 x 512 2 x 512 2 x 512 2 x 512 1 x 512
AMD Bulldozer 2011 1 module 128 bit 2 2 x 128 2 x 128 2 x 128 2 x 128 2 x 128 1 x 128 2 x 128 1 x 128
AMD Piledriver 2012
AMD Steamroller 2014
AMD Excavator 2015
AMD Zen 2017 1 core 128 bit 4 2 x 128 2 x 128 4 x 128 2 x 128 2 x 128 1 x 128 4 x 128 2 x 128

There really isn't much that needs to be said. Intel chips currently have much better SIMD throughput.

 

*It's likely that Cannonlake may also be split into half-throughput and full-throughput models.

 

 

Academia:

 

Is there a version that can use the GPU?

 

This is still a no-go for current generation GPUs. But things may get more interesting with Xeon Phi.

  1. As of 2015, most GPUs are optimized for single-precision performance. Their double-precision and 64-bit integer throughput is far from impressive. (with notable exceptions being the Nvidia Tesla and Titan Black cards)

    The problem is that every single large integer multiplication algorithm uses either double-precision, 64-bit integer multiply, or carry-propagation. All of these operations are inefficient on current GPUs. And no, single-precision cannot be used because it imposes size limits that make the algorithms useless.

  2. GPUs require massive vectorization. Large number arithmetic is difficult to vectorize due to carry-propagation. While the current speedups from CPU vectorization are significant, they were achieved with great difficulty using methods that are unlikely to scale to the level required by GPUs.

  3. Large computations of Pi and other constants are not limited by computing power. The bottleneck is in the data communication. (memory bandwidth, disk I/O, etc...) So throwing GPUs at the problem (even if they could be utilized) would not help much.

 

To expand on the severity of the communication bottleneck:

The only possible option is to utilize GPU onboard memory as a cache for main memory in a manner similar to how y-cruncher currently uses main memory as a cache for disk. But this is an additional level of design complexity that will not be easy to do.

 

Fundamental issues aside, the biggest practical barrier would be the need to rewrite the entire program using GPU programming paradigms. And for a project of this size that's merely a side hobby, it simply isn't feasible.

 

But before we slam the door on GPUs, it's worth mentioning the Xeon Phi processor line. While these aren't exactly GPUs, they are still massively parallelized and have large SIMD vectors. Preliminary benchmarks on Knights Landing are somewhat disappointing even with the AVX512 binaries. But it's difficult to draw any conclusions without access to the hardware and without the ability to tune the program for the hardware.

 

 


Why can't you use distributed computing to set records?

 

No for more or less the same reasons that GPUs aren't useful.

  1. Just as with GPUs, computational power is not the bottleneck. It is the data communication. For this to be feasible as of 2015, everyone would need to have an internet connection speed of more than 1 GB/s. Anything slower than that and it's faster to do it on a single computer.

  2. Computing a lot of digits requires a lot of memory which would need to be distributed among all the participants. But there is no tolerance for data loss and distributed computing means that participants can freely join or leave the network at any time. Therefore, a tremendous amount of redundancy will be needed to ensure that no data is lost when participants leave.

 


Is there a distributed version that performs better on NUMA and HPC clusters?

 

No, but it is a current research topic.

 

For now, the best thing you can do is to interleave memory. In Linux, this can be done by running: numactl --interleave=all "./y-cruncher.out"

Some systems have a BIOS options that do something similar.

 

 


Why have recent Pi records used desktops instead of supercomputers?

While the rest of the world is trending towards more parallelism, computations of Pi seems to have gone the other way.

 

The only recent Pi record which has "gone the other way" is Fabrice Bellard's computation of 2.7 trillion digits back in 2009. That was the major leap from supercomputer to... a single desktop computer. But since then, all the records have been done using increasingly larger (commodity) server hardware. Nevertheless, the hardware used in these computations are still pretty far removed from actual supercomputers.

 

So the real question is: Why aren't we using supercomputers anymore?

 

Unfortunately, I don't have a good answer for it. Sure y-cruncher has been dominating the recent Pi records using single-node computer systems. But that doesn't explain why nobody from the supercomputer field has joined in. Some possible contributing factors are:

  1. The performance gap between CPU and memory has grown so large that perhaps supercomputers simply cannot be efficiently utilized. The recent Pi computations using single-node desktops and servers had disk bandwidth on the order of gigabytes per second. Supercomputer interconnects are generally slower than that.

  2. Supercomputers have more useful things to do. So it's probably more economically viable to use commodity hardware and run for several months than to tie down a multi-million dollar supercomputer for even a few days. Prior to 2010, there were no known desktop programs capable of efficiently computing Pi to that many digits. So it wasn't even possible to run on commodity hardware unless you wrote one yourself.

  3. Supercomputers are generally inaccessible to the public. On the other hand, everybody has a laptop or a desktop. So the pool of programmers is much larger for commodity hardware than supercomputers.

  4. It's easier to program for desktops than supercomputers. This makes it possible to implement more complex programs which could otherwise be prohibitively difficult for supercomputers. This is probably why the majority of the supercomputer Pi records were done with the AGM method instead of the series methods. The AGM is a lot slower, but it's also much simpler.

 

 

Programming:

 

What is the technical explanation for the notorious, "Coefficient is too large" error?

 

The "Coefficient is too large" error is one of many redundancy check which y-cruncher uses. To understand what this redundancy check is at the technical level, it helps to have a basic understanding of large multiplication via Fast Fourier Transform (FFT).

 

In this algorithm, each of the input operands are broken up into coefficients of a polynomial. FFT is then used to multiply the polynomials together. Based on the size of the input polynomial and its coefficients, there are provable bounds for the sizes of each of the coefficients in this output polynomial.

 

y-cruncher has a redundancy check that checks all the coefficients to make sure they are all less than the limit. If any are above this limit, then it knows that a computational error has occurred and it throws the "Coefficient is too large" error.

 

The reason why the "Coefficient is too large" error is so common is a combination of various factors:

y-cruncher uses this coefficient check because it has such high coverage with minimal computational overhead. But this high coverage also makes the error almost useless in helping to track down the soruce of the error since it's basically the same as, "something went wrong".

 

 


Is there a publicly available library for the multi-threaded arithmetic that y-cruncher uses?

 

Yes. But it isn't as stable as a library should be. Support only exists for 64-bit Windows and backwards compatibility breaks on a regular basis.

 

 


Is y-cruncher open-sourced?

 

y-cruncher itself is closed source. But some of the related side-projects like the Digit Viewer and the Number Factory app are open-sourced.

 

 

 

Program Usage:

 

What's the difference between "Total Computation Time" and "Total Time"? Which is relevant for benchmarks?

 

"Total Computation Time" is the total time required to compute the constant. It does not include the time needed to write the digits to disk nor does it include the time needed to verify the base conversion. "Total Time" is the end-to-end time of the entire computation which includes everything.

 

The CPU utilization measurements cover the same thing as the "Total Computation Time". It does not include the output or the base convert verify.

 

For benchmarking, it's better to use the "Total Computation Time". A slow disk that takes a long time to write out the digits will affect neither the computation time nor the CPU utilization measurements. Most other Pi-programs measure time the same way, so y-cruncher does the same for better comparability. All the benchmark charts on this website as well as any forum threads run by myself will use the "Total Computation Time".

 

For world record size computations, we use the "Total Time" since everything is relevant - including down time. If the computation was done in several phases, the run-time that is put in the charts is the difference between the start and end dates.

 

There's currently no "standard" for extremely long-running computations that are neither benchmarks nor world record sized.

 

 


Why does y-cruncher need administrator privileges in Windows to run Swap Mode computations?

 

Privilege elevation is needed to work-around a security feature that would otherwise hurt performance.

 

In Swap Mode, y-cruncher creates large files and writes to them non-sequentially. When you create a new file and write to offset X, the OS will zero the file from the start to X. This zeroing is done for security reasons to prevent the program from reading data that has been leftover from files that have been deleted.

 

The problem is that this zeroing incurs a huge performance hit - especially when these swap files could be terabytes large. The only way to avoid this zeroing is to use the SetFileValidData() function which requires privilege elevation.

 

Linux doesn't have this problem since it implicitly uses sparse files.

 

 


Why is the performance so poor for small computations? The program only gets xx% CPU utilization on my xx core machine for small sizes!!!

 

For small computations, there isn't much that can be parallelized. In fact, spawning N threads for an N core machine may actually take longer than the computation itself! In these cases, the program will decide not to use all available cores. Therefore, parallelism is really only helpful when there is a lot of work to be done.

 

For those who prefer academic terminology, y-cruncher has weak scalability, but not strong scalability. For a fixed computation size, is it not possible to sustain a fixed efficiency while increasing the number of processors. But it is possible if you increase the computation size as well.

 

 

 

Other:

 

What about support for other platforms? Mac, ARM, etc...

 

Short answer: Not right now. There's little to gain for a lot of effort.

 

While it would be nice to have y-cruncher running everywhere, the time and resource commitment is simply too high. So the best I can do is cover the most important platforms and call it day.

 

y-cruncher currently supports 3 platforms, Windows/x86, Windows/x64, and Linux/x64. Either of the x64 platforms is sufficient for y-cruncher's purpose. Windows/x86 is there because the program started that way and it's easy to maintain it alongside Windows/x64. On the other hand, Linux/x64 is a completely different platform with different compilers and system APIs. For this reason, it takes a signifcant amount of time and effort to keep y-cruncher working on Linux/x64.

 

My experience with just Linux has basically convinced me to stay away from any additional platforms for the time being.

 

 

What about Mac/x64?

I'm not going to spend my time figuring out Hackintosh installs nor am I going to buy certified Mac hardware. It's messy enough just to dual-boot Windows/Linux on all my boxes. Triple boot Windows/Linux/Mac? Um...

 

What about ARM?

ARM is not yet competitive with x64 on the high-end server market - let alone mainstream. So there's little to gain from it. Furthermore, I have no experience or expertise on ARM development. As far as having the hardware, I have a smartphone...

 

 

Even if y-cruncher were open-sourced with willing contributors, this wouldn't be easy. For one, there are no ARM/NEON-optimized code-paths. And even if someone were to do it, the code goes through way too many large and breaking changes.