Julia's new broadcasting for in-place operations on polynomials

This post may be interesting for you if you like Julia, if you have a rough understanding of when you'd typically use its broadcasting feature, and if you want to learn about how I re-used that feature to get a significant performance optimization in a setting without axes and indices.

The math problem

A common maths challenge is to find out whether a certain polynomial can be written as a linear combination of some other polynomials. For example, is

y^2

a linear combination of these polynomials

x^5
x^2 + y
x*y + y^2

or not? In fact, it is, but it is not obvious how to find the specific combination:

y^2 = 1*(x^5) + (y + x*y - x^3)*(x^2 + y) + (-x)*(x*y + y^2)

An algorithmic answer to this question is given by the concept of Gröbner basis. (Warning: the Wikipedia description is a bit dense. Understanding this concept is not relevant for the rest of this post.) A key operation in this algorithm is the following reduction step:

f = m1*f - m2*t*g

for some integers m1,m2, for a monomial t and for some polynomial g. These elements are chosen such that the leading term of f cancels.

This key operation is in a tight inner loop, so the faster we can make it, the better. I overloaded Julia's new broadcasting behaviour, and that allowed me to replace the reduction step by

@. f = m1*f - m2*t*g

reaping a 66% performance improvement. Keep reading to learn how!

Under the hood of polynomial operations

Polynomials such as f above are stored as vectors of terms. Each term is a struct containing a BigInt coefficient and a Tuple encoding the exponents of the monomial (code).

When taking the difference of two polynomials, we loop over both vectors of terms and, if any monomial appears in both of them, we subtract one coefficient from the other.

Let's look at a slightly simpler version of the reduction step above:

f = f - g

Because BigInt coefficients are heap allocated, what really happens is that we allocate new BigInts to hold the result of the right-hand side operation. Upon assigning to the left-hand side, we discard all the perfectly fine BigInts we had in f before.

It is therefore tempting to use Base.GMP.MPZ.sub!. Indeed, one way to speed up this operation, at the expense of readability, is my first attempt here: a custom-rolled function that operates on f in-place:

f = inplace_reduce!(f, m1, m2, t, g)

(If you wonder why we both assign the result and use an in-place (!) operation: this is to support both mutable (BigInt) and immutable (Int) coefficients.)

This is in fact the fastest option. Unfortunately, it is not the most readable version, and it also doesn't help anyone else who wants to similarly speed up a different operation.

Broadcasting

Julia's broadcasting support allows operating element-by-element on arrays. For example:

julia> a = [1,2,3];
julia> a .+= [3,2,1]; a
[4,4,4]

(It also allows broadcasting lower-dimensional arrays / scalars over the dimensions of higher-dimensional arrays, but that doesn't quite have an analogue for polynomials so I won't discuss that.) The idea is to re-purpose this for term-by-term operations on polynomials.

Since julia 1.0, thanks to Matt Bauman's awesome work, there is a very nice interface for doing exactly this. It allowed me to do the following things:

Step 1. Decide how I want to broadcast expressions with a Polynomial in them. I call my own broadcasting style Termwise:

broadcastable(p::Polynomial) = Ref(p)
struct Termwise end
BroadcastStyle(::Type{<:Polynomial}) = Termwise()

(code).

Step 2. Decide how to broadcast if several different broadcasting styles exist in the same expression. In my case, I'd like to fall back to normal behaviour if any elements exist that I don't control. For example, if there's any arrays in the expression, then I want the default broadcasting behaviour. This allows something like

f .* [f, f]

to work element-wise on the array. The code to accomplish this looks like this:

BroadcastStyle(s::Termwise, P}, t::Termwise, Q}) = Termwise()
BroadcastStyle(s::Termwise, t::AbstractArrayStyle{0}) = s
BroadcastStyle(s::Termwise, t::BroadcastStyle) = t

(code). This says that the combination of two Termwise broadcasts is again a Termwise, and also the combination of a Termwise and a 0-dimensional array, i.e. a scalar. For any other broadcasting styles, the other style 'wins' over term-wise broadcasting.

Step 3. Iterate over the terms, and know which ones I can modify. I defined a type called TermsMap (code) that just iterates over terms. It has one extra feature: a type parameter InPlace decides whether the terms it yields can be modified. So in the case of the broadcasted version of f = f - g, we would get a TermsMap{true} from f and a TermsMap{false} from g.

The - minus operation is implemented as a TermsMap of its own whose operands are the two TermsMaps from f and g. This allows us to use the in-place version sub! in cases where one of the operands is a TermsMap{true}, such as here.

Step 4. Implement materialize! for term-wise broadcasts. Now all that's left is to iterate over the top-level TermsMap and write the result to f. This happens here.

Hand-optimizing the special case

This made my computations some ~50% faster. It did turn out that, because of the TermsMap abstraction, some time was lost in needless allocations. Some experimentation with @code_warntype pointed at type-inference giving Any for the state of this iteration, so I suspect that every iteration needs an allocation.

I tried a few other ways of expressing that iteration, but none allowed me to reduce the allocations. In the end, I decided to hand-optimize the particular inner-loop broadcast. This gets very close to the performance of inplace_reduce!, but there's still a small difference.

Profiling using StatProfilerHTML indicates that the difference is explained by the following: in the generic broadcasting function, we need to verify at run-time whether the target f and the operands f and g are distinct or not. In the inplace_reduce! case, we can statically know that f is equal to f and distinct from g.

I decided to accept that cost in the interest of readability.

Recursive broadcasting?

The above works really well when f is a Polynomial, but not when it is a Vector{<:Polynomial}. In those cases, an operation like

f .+= g

will re-use the Vector storage for f, but not the Polynomials or their BigInt coefficients. I'm convinced this can be fixed in the same way as above.

It is an interesting analogy to look at the case of Vector{BigInt}. Suppose that we do

julia> a = BigInt[1, 2, 3];
julia> a .+= 3

The current implementation would keep the container a but allocate new BigInt objects. It would be very tempting to operate in-place in these cases! This could be achieved in a a similar way to what I did above.

These two cases together make me wonder whether we couldn't achieve the same thing in an easier and more general way by making broadcasting recursive. For example, we'd define broadcasting semantics as follows:

a .+= 3

would do

a[i] .+= 3

for every index i. We additionally define broadcasting on BigInt variables to compile to the in-place operations.

These semantics would give similar improvements to what I achieved above, now to any code that uses BigInts or arrays of them.

Conclusion

Thanks to Julia's broadcasting, I could use in-place operations in a very readable way. This allowed me to get a significant performance boost without losing any readability.