QSM - Quarter Square Multiplication, 8 x 8 -> 16 Speed Record on Extended Baseline PIC!

QSM - Quarter Square Multiplication, Speed Record on Extended Baseline PIC!

I'm a great fan of PIC10F / PIC12F / PIC16F controllers. They are inexpensive, they offer verstalie peripherals, and they are absoulely energy efficient which is becoming extremely important in the era of IoT. However, they don't provide mulitplication hardware. There are several code snippets for software multiplication around, but there is still the challenge of doing it better. I've taken it! And here we go with the fastest 8 x 8 -> 16 multiplication I know on extended baseline PIC. You've got somethin better? Faster? Step forward!

Related Work

piclist.com offers several multiplication routines with different runtimes. [1] All of them are based on the classical shift-add-algorithm. One routine by an unknown author [2] gets it done in 36 instruction cycles (inlinded). Other code by Scott Datalo checks for leading zeroes in one operand [3]. According to the article this one has a runtime of 20 to 38 cycles but I'm getting different results. Average execution time depends on operand distribution. For equal distribution you are marginally faster than the other algorithm. And you have to fix it because the published code gives wrong results. Anyway, in worst case this routine is slower than the other one.

Sidenote: From time to time I'm thinking of implementing 8 x 8 -> 8 / 16 mult by Karatsuba's algoritm [4] to two 4 x 4 and one 5 x 5 multiplications but couldn't find an implementaion of competitive speed.

Another Idea

piclist.com has a link to a page telling about reducing multiplication to squaring [5]. There you can find an implementation on 8086 (rememeber it?) which gives faster execution than microprogrammed instruction. Wikipedia also shows this technique on its page on multiplication algorithms [6]. In brief: you square sum and difference of the two factors, then subtract these two squares and according to the well known binomial formulas you'll get the fourfold product:

((a + b)^2 - (a - b)^2) / 4
= (a^2 + 2ab + b^2 -(a^2 - 2ab + b^2)) / 4
= (2ab + 2ab) / 4 = a * b

Cute, is't it? This way we need two squares of 9 bit numbers and a few more operations around. What about implementing the squaring as table lookups? 512 entries of 18 bits... Hmm... 18 bit entries don't look that efficient. But we can handle reduce size! (a+b) and (a-b) are either both even or both odd. This means that both squares are either divisible by 4 or both have the form 4k+1. We subtract both squares so we can simply drop the odd parts, perform the divisions before building the tables, and only need 512 entries of 16 bits. Oh, the Wikipedia Article tells that there already was an implementation on the 6502 processor [7]. Remember the 6502? 8 bits, 2 MHz, 2 cycles per instruction, and 64 KByte address space. Built into 8 bit Atari and Commodore devices.

Anyway. This algorithm uses a table of a little less than 1024 bytes holding the square numbers of [0..510] divided by four and perform multiplication by two lookups and a little code around. Extended baseline PIC processors offer such lookups with FSRL/FSRH and INDFL/INDFH registers. PIC10F32x and PIC12F1501 have too small memory, but e.g. PIC12F1822 has 2KWords memory which is enough for multiplication and some code around.

Performance

Below you can see an implementation using 28 instructions besides the tables running in 32 instruction cycles. More cycles than instructions because an indirect load needs two cycles on enhanced baseline PIC.

Space performance doesn't look good at first sight. But if we need really fast multiplication we will inline and unrolled code anyway. Then we need 1024 + 28 * n instructions for n multiplications while unrolled tableless implementations need 35 or more instructions. This means that the QSM program will even get smaller if we use 147 or more inlined multiplications.

Speed is brilliant! The code needs 32 cycles when inlined which is significantl faster than every other 8 x 8 multiplication I found on the net. And since the processors in question don't have caches this speed is for real.

Regarding history this algorithm is not even new. Wikipedia tells us in the article on multiplication algorithms [6] that this algorithm has been known to Babylonic mathematicians already 3,600 years ago, that square tables have been published in 19th century, that Everett L. Johnson proposed in 1980 to use it in computer mathematics, and that there was an implementation on the 6502 in 1995. Why did we wait until 2015 to implement it on the PIC? I can't tell.

Future Work

This list reflects a few ideas what _can_ but not necessarily _will_ be explored in this context.

  • Even faster 8 x 8 -> 8 multiplication with smaller tables. First idea gives 20 cycles. Can we do better?
  • Table based fast 4 x 4 -> 8 multiplication. (Oops, already posted [8].)
  • Is a faster implementation possible?
  • BCD multiplication
  • Finding a fast implementation of Karatsuba's algorithm [4] for 8 x 8 -> 16 multimplication.
  • Implementing fast 10 x n / 12 x n multiplication for digital filter of ADC output.
  • Implementing fast 16 x 16 -> 32 multiplication.
  • Combining several such QSMs by Karatsuba's algorithm [4].
  • Montgomery reduction [9].
  • Trying other long number multiplications like Toom-Cook [10], Schönhage-Strassen [11], chinese remaindering [12]...

The Code

Here we go... [Drum roll!] ;-) The following listing shows the active code. The table is a little bit and a C program generating them will be presented below.

In this example code the squaring tables start at addres 0x300. You can move them but have to align them at a PCH page. For unaligned tables we would need additional code which would need additional runtime.

;Quarter Squaring Multiplication (QSM)
tabh    equ 3	;table starts at addres 0x300
fsrh    equ tabh | 0x80

;input: x, y
;output: zl, zh

mbs
;(x+y)^2
movlw fsrh
movwf FSR0H
movf x,W
addwf y,W
movwf FSR0L
btfsc STATUS,C
incf FSR0H,f
movf INDF0,W
movwf zl
incf FSR0H,f
incf FSR0H,f
movf INDF0,W
movwf zh
;(a-b)^2
movlw fsrh
movwf FSR0H
movf x,W
subwf y,W
btfss STATUS,C
sublw 0
movwf FSR0L
movf INDF0,W
subwf zl,f
btfss STATUS,C
decf zh,f
incf FSR0H,f
incf FSR0H,f
movf INDF0,W
subwf zh,f
return

tabh equ 3
fsrh equ tabh | 0x80

org tabh << 8
;See next paragraph.
END

The Table

This is the C program to generate the table. Copy the following C program, save it, compile it, run it and copy the produced text behind the ‘org tabh << 8’ line.

#include

int main(int argc, char *argv[]) {
int i;
printf("\t;LSB [0…511]^2/4\n");
for (i = 0; i < 512; i++) {
printf("%s0x%02x%s", i % 8 == 0 ? “\tdw\t” : “”, ((i * i) >> 2) & 0xff, i % 8 == 7 ? “\n” : “, “);
}
printf(”\t;MSB [0…511]^2/4\n”);
for (i = 0; i < 512; i++) {
printf("%s0x%02x%s", i % 8 == 0 ? “\tdw\t” : “”, (((i * i) >> 2) >> 8) & 0xff, i % 8 == 7 ? “\n” : ", ");
}
}

References