Conquer the Divide - Faster Integer Division
How to implement C-style integer division on a microcontroller that doesn't provide appropriate instructions? In school we learn how to divide long nunmbers . Wikipedia tells us that we can do better in certain cases . However, these algorithm are only better for numbers longer than a certain limit. For C operations they don't give any advantage. This article presents an algorithm based on piecewise polynomial approximation to implement this operation.
State of the Art
Wikipedia presents a good overview of current division algorithms.  There you can find two different approaches. One is iterative shift and subtract as used in like binary schoolbook division and SRT division. The other one is approximating the reciprocal of the divisor or of the quotient as in Newton Raphson approximation, Goldschmidt algorithm, and the binomial method.
These iterative methods use shift and add / subtract operations resulting in O(n) shifts and O(n) adds / subtracts for an n × n → n division while the approximation methods have quadratic convergence resulting in O(log(n)) multiplications and additions of size n.
Looks as if the approximations are much faster than the iterations. However, asymptotic complexity tells nothing about small problem sizes. We have to look at the absolute numbers. Each iteration of such an approximation needs at least two 16 × 16 bit multiplications which eat many cycles destroying the advantage. Okay, there are is at least one method converging even faster  but this one needs more operations per iteration.
Basically, division algorithms work by estimate and improve a quotient or the reciprocal of the divisor. You guess an approximation, calculate the error and improve the approximation until the error is small enough that you consider your result reached. For schoolbook method you guess the highest order bit, calculate the remainder, and if is too large then you guess the next bit. For Newton-Raphson you make an initial guess, calculate the error, and if the error is too large then you use this result as the new guess for the next improvement by linear estimation. The key for the schoolbook method and SRT division is that improvements are extremely simple. The key for Newton-Raphson, Goldschmidt, and binomial is that they converge fast enough that improvements can be more complex.
The new idea
There seens to be some kind of equilibrium. (Side note: This phenomenon can be observed in many areas of algorithms.) Slowly converging algorithms have simple elementary operations and fast converging algorithms needs complex elementary operations. To get some faster solution we have to find simpler elementary operations or faster converging algorithms. Right?
Wromg! The total runtime is the product of both factors. There is a chance that we get lower runtime if we use suboptimal solutions for both factors. There are other approximation frameworks. What about polynomial approximation of the reciprocal? There are lots of methods: Newton, minimum square residual, minimax, nearly minimax... Let's have a look.
Polynomial approximation is fast. It needs one multiplication by a constant and one addition per degree. First we scale our operands such that we get the divisor in the range [.5, 1) by shifting them to left until MSB of the divisor is set. This reduces the size of the approximation interval. If we interpolate over the whole integral then we will need a polynomial of degree so high that it cannot compete with the other methods.
Piecewise Polynomial Approximation
However, what if we divide the value range in several parts and use different polynomials for different intervals? The point here is that 1/ is a very smooth function with rapdly declining derivates meaning that we only need low degree polynomials for small operand ranges. There are proofs by Taylor and Chebyshev for the error limits but we don't need to go so far since we can easily see that this idea works. Look at the following algorithm:
- Shift left operands until divisor has MSB set.
- Table lookup coefficients for reciprocal approximation of divisor according to MSBs of divisor.
- Evaluate polynomial.
- Multiply dividend with reciprocal of divisor.
- If necessary: Do final error corrections by schoolbook method or even iterative subtraction.
Let's keep it short. Calculation of Newton interpolation on equally distant nodes over equally sized intervals in Octave gives the following results for 16 bit resolution:
- Linear interpolation gives large errors.
- Quadratic interpolation over 8 intervals gives small errors to be corrected in post processing.
- Quadratic interpolation over 16 intervals gives errors smaller than resolution
Division by piecewise polynomial approximation is a generic algorithms that has to be parameterized with approximation method, degree, and interval size. Actually, intervals may even have different sizes but this leads to additional execution time. All these parameters can be optimized according to the underlying hardware.
This article presents a new division algorithm based on piecewise polynomial interpolation giving better runtime on microcontrollers that offer fast multiplication than division algorithms known so far. The motivation is to find a good solution by combining good parts instead of combining extremely good parts with bad ones.
There is still work to do. A few things are:
- Finding (near) minimax polynomial interpolations.
- Reference implementation in C or other high level language.
- Evaluating on relevant microcontrollers like 8051 and ARM.
|||Long division, Wikipedia|
|||Division algorithm, Wikipedia|