This post may be interesting for you if you like Julia, if you have a rough understanding of when you'd typically use its broadcasting feature, and if you want to learn about how I re-used that feature to get a significant performance optimization in a setting without axes and indices.

### The math problem

A common maths challenge is to find out whether a certain polynomial can be written as a linear combination of some other polynomials. For example, is

```
y^2
```

a linear combination of these polynomials

```
x^5
x^2 + y
x*y + y^2
```

or not? In fact, it is, but it is not obvious how to find the specific combination:

```
y^2 = 1*(x^5) + (y + x*y - x^3)*(x^2 + y) + (-x)*(x*y + y^2)
```

An algorithmic answer to this question is given by the concept of
*GrÃ¶bner basis*. (Warning: the Wikipedia description
is a bit dense. Understanding this concept is not relevant for the
rest of this post.) A key operation in this algorithm is the
following reduction step:

```
f = m1*f - m2*t*g
```

for some integers `m1,m2`

, for a monomial `t`

and for some polynomial
`g`

. These elements are chosen such that the leading term of `f`

cancels.

This key operation is in a tight inner loop, so the faster we can make it, the better. I overloaded Julia's new broadcasting behaviour, and that allowed me to replace the reduction step by

```
@. f = m1*f - m2*t*g
```

reaping a 66% performance improvement. Keep reading to learn how!

### Under the hood of polynomial operations

Polynomials such as `f`

above are stored as vectors of terms. Each
term is a `struct`

containing a `BigInt`

coefficient and a `Tuple`

encoding the exponents of the monomial (code).

When taking the difference of two polynomials, we loop over both vectors of terms and, if any monomial appears in both of them, we subtract one coefficient from the other.

Let's look at a slightly simpler version of the reduction step above:

```
f = f - g
```

Because `BigInt`

coefficients are heap allocated, what really happens
is that we allocate new `BigInt`

s to hold the result of the right-hand
side operation. Upon assigning to the left-hand side, we discard all
the perfectly fine `BigInt`

s we had in `f`

before.

It is therefore tempting to use `Base.GMP.MPZ.sub!`

.
Indeed, one way to speed up this operation, at the expense of
readability, is my first attempt here: a custom-rolled
function that operates on `f`

in-place:

```
f = inplace_reduce!(f, m1, m2, t, g)
```

(If you wonder why we both assign the result and use an in-place (`!`

)
operation: this is to support both mutable (`BigInt`

) and immutable (`Int`

)
coefficients.)

This is in fact the fastest option. Unfortunately, it is not the most readable version, and it also doesn't help anyone else who wants to similarly speed up a different operation.

### Broadcasting

Julia's broadcasting support allows operating element-by-element on arrays. For example:

```
julia> a = [1,2,3];
julia> a .+= [3,2,1]; a
[4,4,4]
```

(It also allows broadcasting lower-dimensional arrays / scalars over the dimensions of higher-dimensional arrays, but that doesn't quite have an analogue for polynomials so I won't discuss that.) The idea is to re-purpose this for term-by-term operations on polynomials.

Since julia 1.0, thanks to Matt Bauman's awesome work, there is a very nice interface for doing exactly this. It allowed me to do the following things:

**Step 1. Decide how I want to broadcast expressions with a Polynomial in them.**
I call my own broadcasting style

`Termwise`

:```
broadcastable(p::Polynomial) = Ref(p)
struct Termwise end
BroadcastStyle(::Type{<:Polynomial}) = Termwise()
```

(code).

**Step 2. Decide how to broadcast if several different broadcasting styles exist in the
same expression.** In my case, I'd like to fall back to normal behaviour if any elements
exist that I don't control. For example, if there's any arrays in the expression, then I
want the default broadcasting behaviour. This allows something like

```
f .* [f, f]
```

to work element-wise on the array. The code to accomplish this looks like this:

```
BroadcastStyle(s::Termwise, P}, t::Termwise, Q}) = Termwise()
BroadcastStyle(s::Termwise, t::AbstractArrayStyle{0}) = s
BroadcastStyle(s::Termwise, t::BroadcastStyle) = t
```

(code). This says that the combination of two `Termwise`

broadcasts is again
a `Termwise`

, and also the combination of a `Termwise`

and a 0-dimensional array, i.e.
a scalar. For any other broadcasting styles, the other style 'wins' over term-wise
broadcasting.

**Step 3. Iterate over the terms, and know which ones I can modify.**
I defined a type called `TermsMap`

(code) that just iterates over terms. It
has one extra feature: a type parameter `InPlace`

decides whether the terms it yields
can be modified. So in the case of the broadcasted version of `f = f - g`

, we would
get a `TermsMap{true}`

from `f`

and a `TermsMap{false}`

from `g`

.

The `-`

minus operation is implemented as a `TermsMap`

of its own whose
operands are the two `TermsMap`

s from `f`

and `g`

. This allows us to use the
in-place version `sub!`

in cases where one of the operands is a
`TermsMap{true}`

, such as here.

**Step 4. Implement materialize! for term-wise broadcasts.**
Now all that's left is to iterate over the top-level

`TermsMap`

and write the
result to `f`

. This happens here.### Hand-optimizing the special case

This made my computations some ~50% faster. It did turn out that, because of the
`TermsMap`

abstraction, some time was lost in needless allocations. Some experimentation
with `@code_warntype`

pointed at type-inference giving `Any`

for the state of
this iteration, so I suspect that every iteration needs an allocation.

I tried a few other ways of expressing that iteration, but none allowed me to reduce
the allocations. In the end, I decided to hand-optimize the particular inner-loop
broadcast. This gets very close to the performance of `inplace_reduce!`

,
but there's still a small difference.

Profiling using StatProfilerHTML indicates that the difference
is explained by the following: in the generic broadcasting function, we need to
verify at run-time whether the target `f`

and the operands `f`

and `g`

are
distinct or not. In the `inplace_reduce!`

case, we can statically know that
`f`

is equal to `f`

and distinct from `g`

.

I decided to accept that cost in the interest of readability.

### Recursive broadcasting?

The above works really well when f is a `Polynomial`

, but not when it is a
`Vector{<:Polynomial}`

. In those cases, an operation like

```
f .+= g
```

will re-use the `Vector`

storage for `f`

, but not the `Polynomial`

s or their
`BigInt`

coefficients. I'm convinced this can be fixed in the same way as
above.

It is an interesting analogy to look at the case of `Vector{BigInt}`

.
Suppose that we do

```
julia> a = BigInt[1, 2, 3];
julia> a .+= 3
```

The current implementation would keep the container `a`

but allocate new
`BigInt`

objects. It would be very tempting to operate in-place in these cases!
This could be achieved in a a similar way to what I did above.

These two cases together make me wonder whether we couldn't achieve the same thing in an easier and more general way by making broadcasting recursive. For example, we'd define broadcasting semantics as follows:

```
a .+= 3
```

would do

```
a[i] .+= 3
```

for every index `i`

. We additionally define broadcasting on `BigInt`

variables
to compile to the in-place operations.

These semantics would give similar improvements to what I achieved above, now
to any code that uses `BigInt`

s or arrays of them.

### Conclusion

Thanks to Julia's broadcasting, I could use in-place operations in a very readable way. This allowed me to get a significant performance boost without losing any readability.