This post may be interesting for you if you like Julia, if you have a rough understanding of when you'd typically use its broadcasting feature, and if you want to learn about how I re-used that feature to get a significant performance optimization in a setting without axes and indices.
The math problem
A common maths challenge is to find out whether a certain polynomial can be written as a linear combination of some other polynomials. For example, is
y^2
a linear combination of these polynomials
x^5
x^2 + y
x*y + y^2
or not? In fact, it is, but it is not obvious how to find the specific combination:
y^2 = 1*(x^5) + (y + x*y - x^3)*(x^2 + y) + (-x)*(x*y + y^2)
An algorithmic answer to this question is given by the concept of Gröbner basis. (Warning: the Wikipedia description is a bit dense. Understanding this concept is not relevant for the rest of this post.) A key operation in this algorithm is the following reduction step:
f = m1*f - m2*t*g
for some integers m1,m2
, for a monomial t
and for some polynomial
g
. These elements are chosen such that the leading term of f
cancels.
This key operation is in a tight inner loop, so the faster we can make it, the better. I overloaded Julia's new broadcasting behaviour, and that allowed me to replace the reduction step by
@. f = m1*f - m2*t*g
reaping a 66% performance improvement. Keep reading to learn how!
Under the hood of polynomial operations
Polynomials such as f
above are stored as vectors of terms. Each
term is a struct
containing a BigInt
coefficient and a Tuple
encoding the exponents of the monomial (code).
When taking the difference of two polynomials, we loop over both vectors of terms and, if any monomial appears in both of them, we subtract one coefficient from the other.
Let's look at a slightly simpler version of the reduction step above:
f = f - g
Because BigInt
coefficients are heap allocated, what really happens
is that we allocate new BigInt
s to hold the result of the right-hand
side operation. Upon assigning to the left-hand side, we discard all
the perfectly fine BigInt
s we had in f
before.
It is therefore tempting to use Base.GMP.MPZ.sub!
.
Indeed, one way to speed up this operation, at the expense of
readability, is my first attempt here: a custom-rolled
function that operates on f
in-place:
f = inplace_reduce!(f, m1, m2, t, g)
(If you wonder why we both assign the result and use an in-place (!
)
operation: this is to support both mutable (BigInt
) and immutable (Int
)
coefficients.)
This is in fact the fastest option. Unfortunately, it is not the most readable version, and it also doesn't help anyone else who wants to similarly speed up a different operation.
Broadcasting
Julia's broadcasting support allows operating element-by-element on arrays. For example:
julia> a = [1,2,3];
julia> a .+= [3,2,1]; a
[4,4,4]
(It also allows broadcasting lower-dimensional arrays / scalars over the dimensions of higher-dimensional arrays, but that doesn't quite have an analogue for polynomials so I won't discuss that.) The idea is to re-purpose this for term-by-term operations on polynomials.
Since julia 1.0, thanks to Matt Bauman's awesome work, there is a very nice interface for doing exactly this. It allowed me to do the following things:
Step 1. Decide how I want to broadcast expressions with a Polynomial
in them.
I call my own broadcasting style Termwise
:
broadcastable(p::Polynomial) = Ref(p)
struct Termwise end
BroadcastStyle(::Type{<:Polynomial}) = Termwise()
(code).
Step 2. Decide how to broadcast if several different broadcasting styles exist in the same expression. In my case, I'd like to fall back to normal behaviour if any elements exist that I don't control. For example, if there's any arrays in the expression, then I want the default broadcasting behaviour. This allows something like
f .* [f, f]
to work element-wise on the array. The code to accomplish this looks like this:
BroadcastStyle(s::Termwise, P}, t::Termwise, Q}) = Termwise()
BroadcastStyle(s::Termwise, t::AbstractArrayStyle{0}) = s
BroadcastStyle(s::Termwise, t::BroadcastStyle) = t
(code). This says that the combination of two Termwise
broadcasts is again
a Termwise
, and also the combination of a Termwise
and a 0-dimensional array, i.e.
a scalar. For any other broadcasting styles, the other style 'wins' over term-wise
broadcasting.
Step 3. Iterate over the terms, and know which ones I can modify.
I defined a type called TermsMap
(code) that just iterates over terms. It
has one extra feature: a type parameter InPlace
decides whether the terms it yields
can be modified. So in the case of the broadcasted version of f = f - g
, we would
get a TermsMap{true}
from f
and a TermsMap{false}
from g
.
The -
minus operation is implemented as a TermsMap
of its own whose
operands are the two TermsMap
s from f
and g
. This allows us to use the
in-place version sub!
in cases where one of the operands is a
TermsMap{true}
, such as here.
Step 4. Implement materialize!
for term-wise broadcasts.
Now all that's left is to iterate over the top-level TermsMap
and write the
result to f
. This happens here.
Hand-optimizing the special case
This made my computations some ~50% faster. It did turn out that, because of the
TermsMap
abstraction, some time was lost in needless allocations. Some experimentation
with @code_warntype
pointed at type-inference giving Any
for the state of
this iteration, so I suspect that every iteration needs an allocation.
I tried a few other ways of expressing that iteration, but none allowed me to reduce
the allocations. In the end, I decided to hand-optimize the particular inner-loop
broadcast. This gets very close to the performance of inplace_reduce!
,
but there's still a small difference.
Profiling using StatProfilerHTML indicates that the difference
is explained by the following: in the generic broadcasting function, we need to
verify at run-time whether the target f
and the operands f
and g
are
distinct or not. In the inplace_reduce!
case, we can statically know that
f
is equal to f
and distinct from g
.
I decided to accept that cost in the interest of readability.
Recursive broadcasting?
The above works really well when f is a Polynomial
, but not when it is a
Vector{<:Polynomial}
. In those cases, an operation like
f .+= g
will re-use the Vector
storage for f
, but not the Polynomial
s or their
BigInt
coefficients. I'm convinced this can be fixed in the same way as
above.
It is an interesting analogy to look at the case of Vector{BigInt}
.
Suppose that we do
julia> a = BigInt[1, 2, 3];
julia> a .+= 3
The current implementation would keep the container a
but allocate new
BigInt
objects. It would be very tempting to operate in-place in these cases!
This could be achieved in a a similar way to what I did above.
These two cases together make me wonder whether we couldn't achieve the same thing in an easier and more general way by making broadcasting recursive. For example, we'd define broadcasting semantics as follows:
a .+= 3
would do
a[i] .+= 3
for every index i
. We additionally define broadcasting on BigInt
variables
to compile to the in-place operations.
These semantics would give similar improvements to what I achieved above, now
to any code that uses BigInt
s or arrays of them.
Conclusion
Thanks to Julia's broadcasting, I could use in-place operations in a very readable way. This allowed me to get a significant performance boost without losing any readability.