When my coworker @tvdw mentioned, a while ago, that memory throughput was now the main bottleneck of his number cruncher, I remember understanding that as a relatively esoteric situation. Theoretically, I understand caching effects and that registers and CPU cache are much quicker to access than main memory, but this never had actual implications for me.
I recently got in a situation where a priority queue operation was my main bottleneck. Splitting out operations line-by-line, it turned out that array indexing was the slow part. This directed me down a rabbit hole that I shouldn't keep for myself - hence the blog post :)
Summing a big array
First, let me convince you that you, too, can see cache effects in real life. It's most apparent in a compiled language, so I'll use Julia which compiles its functions to your native instruction set through LLVM.
julia> A = rand(Int, 2^25 - 1);
We now have an array of a few hundred megabytes in memory. My plan is to compute its sum, which is just an excuse to force the processor to access every item once. The knack is that we don't have to access them "from left to right".
Here's my function definition -- don't worry, I'll explain it step by step in a second:
julia> s(a, stride) = ( n = length(a); sum(a[mod1(stride*i,n)] for i=1:n) )
It takes an array
a which we are going to sum (I'll pass
A eventually) and an integer number
stride. To understand what it does, let's first look at the case
stride = 1. (In case you are not familiar with Julia, for the following you
should know that it has 1-based indexing like Matlab and Fortran. That's why it
has a specific modulo operator
mod1 that takes values between
In that case, we get
( n = length(a); sum(a[mod1(i, n)] for i=1:n) )
1 <= i <= n, the part
mod1(i, n) is just
i. So the whole thing reduces to
( n = length(a); sum(a[i] for i=1:n) )
which just computes the sum of the elements of
What happens when
stride is greater than one? Let's take
stride = 4 as an example.
In this case, we obtain
( n = length(a); sum(a[mod1(4*i, n)] for i=1:n) )
and so we are summing
a + a + a + a + a + ...
After about 2^23 summands, we end up with:
... + a[2^25 - 15] + a[2^25 - 11] + a[2^25 - 7] + a[2^25 - 3]
and that's when something interesting happens. The next value of
4*i will be
2^25, and because of
mod1(..., n), that brings us back to
1. So this continues with
+ a + a + a + ...
and we fill up exactly the gaps that we left in the first iteration. A similar thing occurs twice more. The end result is that we sum the same numbers, but in a different order. Indeed, we get the same result:
julia> s(A, 1) == s(A, 4) true
In case you wonder why the gaps fill up as nicely as they do: the technical reason is that
stride(in this case
n(in this case
2^25 - 1) have a least common multiple that is equal to their product. Another way of saying that is that they have a greatest common divisor equal to
1. And that's because
4is a power of two, but
2^25 - 1is odd. Try it out:
julia> s(A, 1) == s(A, 2) true julia> s(A, 1) == s(A, 4) true julia> s(A, 1) == s(A, 8) true
On the other hand, since
2^25 - 1is divisible by it,
stride = 31does not work:
julia> s(A, 1) == s(A, 31) false
Let's get back to caching. As we've seen, we have different ways to compute the same sum. However, as you can see below, because of caching, the times they take are very different:
julia> @time s(A, 1); 0.276059 seconds (9 allocations: 272 bytes) julia> @time s(A, 2); 0.277251 seconds (9 allocations: 272 bytes) julia> @time s(A, 8); 0.288234 seconds (9 allocations: 272 bytes) julia> @time s(A,16); 0.456280 seconds (9 allocations: 272 bytes) julia> @time s(A,64); 0.694755 seconds (9 allocations: 272 bytes) julia> @time s(A,256); 1.232814 seconds (9 allocations: 272 bytes)
The last one is almost 5x slower than the first! Why? There's two possible reasons I can think of:
- The bigger stride gives a bigger operand to the modulo operation, which may become slower as a result;
- Out-of-order memory access is slower.
We can discard the first easily by collecting timings for the same computation but skipping the memory access (summing the indices instead):
julia> t(a,stride) = (n = length(a); sum(mod1(stride*i,n) for i in 1:n)); julia> t(A,1) == t(A,256) true julia> @time t(A,1); 0.285911 seconds (8 allocations: 288 bytes) julia> @time t(A,256); 0.272905 seconds (8 allocations: 288 bytes)
(As we can see, the
n modulo operations dominate the operation for small
stride. When we take that into account, the timing difference
for memory access alone becomes far greater than just 5x.)
This leaves the second reason. Whenever your CPU accesses data in memory, it doesn't just access the byte you requested; it requests a block of bytes (a cache line) and keeps them around. If you access those ones next, that will hardly cost any time compared to the initial access.
What does that mean for our case? If
stride=1, we access our array in memory
order. That means we reap maximal profit from the CPU's caching strategy:
whenever it has retrieved a cache line, we'll read the entire thing before
moving to the next cache line.
When the stride is greater, the effectiveness of the caching strategy rapidly diminishes, because we skip many of the bytes that the CPU has just made available to us.
In conclusion, it is very easy to see low-level caching effects measureably in real-life. There is a few related topics that we haven't touched at all:
- The compiler may 'vectorize' some operations. That is,
sum(a)may be compiled to an even faster version than what we've seen above.
- There is several cache layers, each subsequent layer being slower-but-bigger. By analyzing the timing increases above, it should be possible to understand the timings and sizes of these cache layers.
- I still want to show you how this relates back to priority queues :)
All of those topics are for another time. Thanks for reading!