The challenge
On piclist.com we find lots of basic arithmetic routines including multiplication of two 4 bit numbers giving an 8 bit result [1]. It gives detailed derivation of his code which is small and fast, indeed.
But we can do better! Since this article was published Microchip has brought the enhanced baseline PIC to market which has competitive price, larger instruction set, and more special function registers with increased functionality. This gives the option of simple and fast table lookup in program memory.
The Solution
Well, it's rather straightforward: 4 x 4 -> 8 multiplication only needs a table of 256 byte entries. For fast lookup we align the table on a 0x100 boundary - in other words: on a 256 byte page. Access is done via FSR0L / FSR0H / INDF0. FSR0h doesn't change and can be reused if not needed otherwise. You can find the resulting code at the end of this post.
Space Performance
We need 256 instructions aligned to 0x100 for the table, two instuctions to initialize FSR0H, 4 instructions for the lookup, and one more instruction if we run the code as a subroutine instead of inilined.
The table can be reused for all such operations on a single chip and the initialization code, too, in many cases. Since we are talking about _fast_ operations we can assume that the code will be inlined. This gives a space requirement of 4 * n + 258 instructions for n inlined operations. The code of [1] needs 12 * n instructions for n inlined operations. This means that we get smaller as soon as we implement 24 such multiplications!
Time complexity
Table looup needs 5 cycles for 4 instructions. (Indirect read needs 2 cycles on enhanced baseline PIC.) This is sufficient if FSR0H has been initialized before. Including this init the code needs 7 cycles for 6 instructions and if we call it as a subroutine we have to add a return giving a total of 9 cycles for 7 instructions. To be honest, in this case we also need the call resulting in 11 cycles for 8 instructions.
Results
This post demonstrates an extremely fast 4 x 4 -> 8 bit multiplication. This is implemented by table lookup on enhanced baseline PIC. Code size is small for a single operation with large constant overhead.
This code doesn't run on baseline PIC that don't offer indirect access to program memory like PIC12F508. However, the new enhanced baseline PIC controllers are available at competitve price and deliver more functionality and lower poser consumption.
4 x 4 multiplication is not a real complex operation and doesn't seem very powerful at first sight. However, if small PIC controllers are used to recognize certaion patterns by correlation or in signal filtering by convolution for data transmission or loop control then we might use such a low precision operation for a first estimate and only invest into complex calculation if we detect significance.
Future Work
This list reflects a few ideas what _can_ but not necessarily _will_ be explored in this context.
- A little slower version using addwf PCL / retlw result table lookup on baseline PIC.
- Modify operand lengths like 3 x 3 -> 6, 3 x 5 -> 8.
- 4 x 4 -> 4 bit division.
- Other operations like 4 x 4 -> 4 mod 15 multiplication or 4 -> 8 squaring.
- Karatsuba's algorithm [2] based on this.
References
- [1] http://www.picist.com/techref/microchip/math/mul/4x4.htm
- [2] http://en.wikipedia.org/wiki/Karatsuba_algorithm
Code
;x input ;y input ;W resulttab equ 0x100
tabh equ tab >> 8
fsrh equ tabh | 0x80m4x4to8init
movlw fsrh
movwf FSR0H
m4x4to8
swapf x,W
iorwf y,W
movwf FSR0L
movf INDF0,W
#ifndef M4X4TO8_INLINE
return
#endiforg tab dw .0, .0, .0, .0, .0, .0, .0, .0, .0, .0, .0, .0, .0, .0, .0, .0 dw .0, .1, .2, .3, .4, .5, .6, .7, .8, .9, .10, .11, .12, .13, .14, .15 dw .0, .2, .4, .6, .8,.10,.12, .14, .16, .18, .20, .22, .24, .26, .28, .30 dw .0, .3, .6, .9,.12,.15,.18, .21, .24, .27, .30, .33, .36, .39, .42, .45 dw .0, .4, .8,.12,.16,.20,.24, .28, .32, .36, .40, .44, .48, .52, .56, .60 dw .0, .5,.10,.15,.20,.25,.30, .35, .40, .45, .50, .55, .60, .65, .70, .75 dw .0, .6,.12,.18,.24,.30,.36, .42, .48, .54, .60, .66, .72, .78, .84, .90 dw .0, .7,.14,.21,.28,.35,.42, .49, .56, .63, .70, .77, .84, .91, .98,.105 dw .0, .8,.16,.24,.32,.40,.48, .56, .64, .72, .80, .88, .96,.104,.112,.120 dw .0, .9,.18,.27,.36,.45,.54, .63, .72, .81, .90, .99, 108,.117,.126,.135 dw .0,.10,.20,.30,.40,.50,.60, .70, .80, .90,.100,.110,.120,.130,.140,.150 dw .0,.11,.22,.33,.44,.55,.66, .77, .88, .99,.110,.121,.132,.143,.154,.165 dw .0,.12,.24,.36,.48,.60,.72, .84, .96,.108,.120,.132,.144,.156,.168,.180 dw .0,.13,.26,.39,.52,.65,.78, .91,.104,.117,.130,.143,.156,.169,.182,.195 dw .0,.14,.28,.42,.56,.70,.84, .98,.112,.126,.140,.154,.168,.182,.196,.210 dw .0,.15,.30,.45,.60,.75,.90,.105,.120,.135,.150,.165,.180,.195,.210,.225