Ram access as a bottleneck

Posted by Timo Kluck on vr 11 augustus 2017

Introduction

When my coworker @tvdw mentioned, a while ago, that memory throughput was now the main bottleneck of his number cruncher, I remember understanding that as a relatively esoteric situation. Theoretically, I understand caching effects and that registers and CPU cache are much quicker to access than main memory, but this never had actual implications for me.

I recently got in a situation where a priority queue operation was my main bottleneck. Splitting out operations line-by-line, it turned out that array indexing was the slow part. This directed me down a rabbit hole that I shouldn't keep for myself - hence the blog post :)

Summing a big array

First, let me convince you that you, too, can see cache effects in real life. It's most apparent in a compiled language, so I'll use Julia which compiles its functions to your native instruction set through LLVM.

julia> A = rand(Int, 2^25 - 1);

We now have an array of a few hundred megabytes in memory. My plan is to compute its sum, which is just an excuse to force the processor to access every item once. The knack is that we don't have to access them "from left to right".

Here's my function definition -- don't worry, I'll explain it step by step in a second:

julia> s(a, stride) = ( n = length(a); sum(a[(stride*i) % n + 1] for i=0:(n-1)) )

It takes an array a which we are going to sum (I'll pass A eventually) and an integer number called stride. To understand what it does, let's first look at the case stride = 1. (In case you are not familiar with Julia, for the following you should know that it has 1-based indexing like Matlab and Fortran.)

In that case, we get

( n = length(a); sum(a[i % n + 1] for i=0:(n-1)) )

and because 0 <= i < n, the part i % n is just i. So the whole thing reduces to

( n = length(a); sum(a[i + 1] for i=0:(n-1)) )

which just computes the sum by starting at 1 and ending at n.

What happens when stride is greater than one? Let's take stride = 4 as an example. In this case, we obtain

( n = length(a); sum(a[(4*i) % n + 1] for i=0:(n-1)) )

and so we are summing

a[1] + a[5] + a[9] + a[13] + a[17] + ...

After about 2^23 summands, we end up with:

... + a[2^25 - 15] + a[2^25 - 11] + a[2^25 - 7] + a[2^25 - 3]

and that's when something interesting happens. The next value of 4*i will be 2^25, and because of % n, that brings us back to 1. So this continues with

+ a[2] + a[6] + a[10] + ...

and we fill up exactly the gaps that we left in the first iteration. A similar thing occurs twice more. The end result is that we sum the same numbers, but in a different order. Indeed, we get the same result:

julia> s(A, 1) == s(A, 4)
true

In case you wonder why the gaps fill up as nicely as they do: the technical reason is that stride (in this case 4) and n (in this case 2^25 - 1) have a least common multiple that is equal to their product. Another way of saying that is that they have a greatest common divisor equal to 1. And that's because 4 is a power of two, but 2^25 - 1 is odd. Try it out:

julia> s(A, 1) == s(A, 2)
true
julia> s(A, 1) == s(A, 4)
true
julia> s(A, 1) == s(A, 8)
true

On the other hand, since 2^25 - 1 is divisible by it, stride = 31 does not work:

julia> s(A, 1) == s(A, 31)
false

Timing

Let's get back to caching. As we've seen, we have different ways to compute the same sum. However, as you can see below, because of caching, the times they take are very different:

julia> @time s(a, 1);
  0.276059 seconds (9 allocations: 272 bytes)
julia> @time s(a, 2);
  0.277251 seconds (9 allocations: 272 bytes)
julia> @time s(A, 8);
  0.288234 seconds (9 allocations: 272 bytes)
julia> @time s(a,16);
  0.456280 seconds (9 allocations: 272 bytes)
julia> @time s(A,64);
  0.694755 seconds (9 allocations: 272 bytes)
julia> @time s(A,256);
  1.232814 seconds (9 allocations: 272 bytes)

The last one is almost 5x slower than the first! Why? There's two possible reasons I can think of:

  1. The bigger stride gives a bigger operand to the modulo operation, which may become slower as a result;
  2. Out-of-order memory access is slower.

We can discard the first easily by collecting timings for the same computation but skipping the memory access (summing the indices instead):

julia> t(a,stride) = (n = length(a); sum((stride*i)%n for i in 1:n));
julia> t(A,1) == t(A,256)
true
julia> @time t(A,1);
  0.285911 seconds (8 allocations: 288 bytes)
julia> @time t(A,256);
  0.272905 seconds (8 allocations: 288 bytes)

(As we can see, the n modulo operations dominate the operation for small values of stride. When we take that into account, the timing difference for memory access alone becomes far greater than just 5x.)

This leaves the second reason. Whenever your CPU accesses data in memory, it doesn't just access the byte you requested; it requests a block of bytes (a cache line) and keeps them around. If you access those ones next, that will hardly cost any time compared to the initial access.

What does that mean for our case? If stride=1, we access our array in memory order. That means we reap maximal profit from the CPU's caching strategy: whenever it has retrieved a cache line, we'll read the entire thing before moving to the next cache line.

When the stride is greater, the effectiveness of the caching strategy rapidly diminishes, because we skip many of the bytes that the CPU has just made available to us.

Conclusion

In conclusion, it is very easy to see low-level caching effects measureably in real-life. There is a few related topics that we haven't touched at all:

  • The compiler may 'vectorize' some operations. That is, sum(a) may be compiled to an even faster version than what we've seen above.
  • There is several cache layers, each subsequent layer being slower-but-bigger. By analyzing the timing increases above, it should be possible to understand the timings and sizes of these cache layers.
  • I still want to show you how this relates back to priority queues :)

All of those topics are for another time. Thanks for reading!