Conquer the Divide - Faster Integer Division

virtupic · December 22, 2015, 2:07pm

Conquer the Divide - Faster Integer Division

Introduction

How to implement C-style integer division on a microcontroller that doesn't provide appropriate instructions? In school we learn how to divide long nunmbers [1]. Wikipedia tells us that we can do better in certain cases [2]. However, these algorithm are only better for numbers longer than a certain limit. For C operations they don't give any advantage. This article presents an algorithm based on piecewise polynomial approximation to implement this operation.

State of the Art

Wikipedia presents a good overview of current division algorithms. [2] There you can find two different approaches. One is iterative shift and subtract as used in like binary schoolbook division and SRT division. The other one is approximating the reciprocal of the divisor or of the quotient as in Newton Raphson approximation, Goldschmidt algorithm, and the binomial method.

These iterative methods use shift and add / subtract operations resulting in O(n) shifts and O(n) adds / subtracts for an n × n → n division while the approximation methods have quadratic convergence resulting in O(log(n)) multiplications and additions of size n.

Looks as if the approximations are much faster than the iterations. However, asymptotic complexity tells nothing about small problem sizes. We have to look at the absolute numbers. Each iteration of such an approximation needs at least two 16 × 16 bit multiplications which eat many cycles destroying the advantage. Okay, there are is at least one method converging even faster [3] but this one needs more operations per iteration.

Basically, division algorithms work by estimate and improve a quotient or the reciprocal of the divisor. You guess an approximation, calculate the error and improve the approximation until the error is small enough that you consider your result reached. For schoolbook method you guess the highest order bit, calculate the remainder, and if is too large then you guess the next bit. For Newton-Raphson you make an initial guess, calculate the error, and if the error is too large then you use this result as the new guess for the next improvement by linear estimation. The key for the schoolbook method and SRT division is that improvements are extremely simple. The key for Newton-Raphson, Goldschmidt, and binomial is that they converge fast enough that improvements can be more complex.

The new idea

There seens to be some kind of equilibrium. (Side note: This phenomenon can be observed in many areas of algorithms.) Slowly converging algorithms have simple elementary operations and fast converging algorithms needs complex elementary operations. To get some faster solution we have to find simpler elementary operations or faster converging algorithms. Right?

Wromg! The total runtime is the product of both factors. There is a chance that we get lower runtime if we use suboptimal solutions for both factors. There are other approximation frameworks. What about polynomial approximation of the reciprocal? There are lots of methods: Newton, minimum square residual, minimax, nearly minimax... Let's have a look.

Polynomial approximation

Polynomial approximation is fast. It needs one multiplication by a constant and one addition per degree. First we scale our operands such that we get the divisor in the range [.5, 1) by shifting them to left until MSB of the divisor is set. This reduces the size of the approximation interval. If we interpolate over the whole integral then we will need a polynomial of degree so high that it cannot compete with the other methods.

Piecewise Polynomial Approximation

However, what if we divide the value range in several parts and use different polynomials for different intervals? The point here is that 1/ is a very smooth function with rapdly declining derivates meaning that we only need low degree polynomials for small operand ranges. There are proofs by Taylor and Chebyshev for the error limits but we don't need to go so far since we can easily see that this idea works. Look at the following algorithm:

Shift left operands until divisor has MSB set.
Table lookup coefficients for reciprocal approximation of divisor according to MSBs of divisor.
Evaluate polynomial.
Multiply dividend with reciprocal of divisor.
If necessary: Do final error corrections by schoolbook method or even iterative subtraction.

Let's keep it short. Calculation of Newton interpolation on equally distant nodes over equally sized intervals in Octave gives the following results for 16 bit resolution:

Linear interpolation gives large errors.
Quadratic interpolation over 8 intervals gives small errors to be corrected in post processing.
Quadratic interpolation over 16 intervals gives errors smaller than resolution

Division by piecewise polynomial approximation is a generic algorithms that has to be parameterized with approximation method, degree, and interval size. Actually, intervals may even have different sizes but this leads to additional execution time. All these parameters can be optimized according to the underlying hardware.

Conclusion

This article presents a new division algorithm based on piecewise polynomial interpolation giving better runtime on microcontrollers that offer fast multiplication than division algorithms known so far. The motivation is to find a good solution by combining good parts instead of combining extremely good parts with bad ones.

Future Work

There is still work to do. A few things are:

Finding (near) minimax polynomial interpolations.
Reference implementation in C or other high level language.
Evaluating on relevant microcontrollers like 8051 and ARM.

References

[1]	Long division, Wikipedia
[2]	Division algorithm, Wikipedia

scharette · December 29, 2015, 2:00pm

AVR?

Interseting stuff. I’d be curious to see an implementation of this algorithm on 8-bit AVRs to see how it would perform.

virtupic · December 30, 2015, 9:08am

Reference implementation - c, AVR, 8051, ARM

Yes, I will do a reference implementation soon. First in C to show the concept and then for a few microcontrollers I can find in my workbench. Basically I want to start with Laser Bee (8051), LPC and STM32 (ARM Cortex M0(+) / M3.

You’ve caught me! Actually, I haven’t envisioned AVR since I hade none at home. Okay, there should be a few ATtiny chips somewhere but there is no hope to find them. And there should be significant speed gain on AVR. If not, then it is your fault that I’ve just spend € 25,99 for an arduino micro board. Yes, I’ve installed the IDE and am able to write to the flash.

I will need a few days because I am working on other stuff currently. Stay tuned. And thank you for your interest!

virtupic · January 13, 2016, 3:07am

Please standby, I’m currently too busy with other stuff

Well, the subject tells it. Day job, side job, preparing a lightning talk on another topic for FOSDEM…

The first version of AVR code is written now. Now it’s time for debugging. I expect it to need some 140 cycles compared to about 200 cycles of the gcc code. And I have to select the right license terms to prevent liability for errors that might still be in the published version. I’ll continue in February or earlier.

Anyway, here we have an untested conceptual version in pseudocode for inspiration - no guarantee that it works correctly, do not use for coding:

// test input:
x = 30000;
y = 200;

shift = 0;
y = y;
while (y < 32768)
	y = y + y;
	shift = shift + 1;
endwhile

piece = floor(y / 2048) - 16;
ymin = 32768 + 2048 * piece;

yp = y - ymin;

A = [
  [65535, 65420, 60158],
  [61681, 57961, 50401],
  [58255, 51708, 42645],
  [55189, 46415, 36403],
  [52429, 41895, 31322],
  [49933, 38003, 27144],
  [47663, 34630, 23677],
  [45591, 31687, 20777],
  [43691, 29103, 18331],
  [41944, 26823, 16255],
  [40330, 24801, 14481],
  [38837, 22999, 12956],
  [37450, 21386, 11638],
  [36158, 19938, 10492],
  [34953, 18631,  9493],
  [33826, 17449,  8616],
];

a0 = A[piece][0];
a1 = A[piece][1];
a2 = A[piece][2];
r0 = a0;
r1 = floor(a1 * yp / 32768);
r2 = floor(a2 * yp * yp / 65536 / 16384);
r  = r0 - r1 + r2;

r = a0 + floor((-a1 + ((floor((a2 * yp) / 65536)) * yp)) / 16384);

q = round(x * r / 65536 / power(2, 15 - shift));

scharette · January 22, 2020, 6:38pm