ASM02 - How Fast Dividing, with Rounding, a 32-Bit dividend by an 8-Bit Divisor Could be?

KerimF · Apr 7, 2023

As far as you know, is it possible to divide (by an AVR code) a 32-bit dividend by an 8-bit register, with rounding, in less than 140 cycles?
You can assume that there is not less than 1200 bytes free for this fast code (in ATmega8, for example).

Since I expect no one around here knows (or will know) how to do it (likely due to lack of interest), I will post the code next week.
Hint: You may like reading:
ASM01 – Fast AVR Code for Dividing 16-Bit Register by 8-Bit Register (0 to 255) with Rounding | Forum for Electronics (edaboard.com)

Have fun,
Kerim

KerimF · Apr 10, 2023

I also had to work on this division code because I did it in the past for another range than 0 to 255 (the possible values of an 8-bit register, as a divisor).

The best speed I was able to achieve is 142 cycles (instead of 140), and its size is less than 1300 bytes (instead of 1200). I failed in my own exam

Now, I let you gain 2 cycles and 100 bytes. Be happy.

Have fun,
Kerim

D.A.(Tony)Stewart · Apr 10, 2023

ldi r16, low(0x12345678) ; Load low byte of dividend
ldi r17, high(0x12345678) ; Load high byte of dividend
ldi r18, hh(0x12345678) ; Load high byte of high word of dividend
ldi r19, hhh(0x12345678) ; Load high byte of highest word of dividend
ldi r20, 0x12 ; Load divisor into register r20
clr r21 ; Clear remainder register
clr r22 ; Clear carry register

divs r16, r20 ; Divide low word of dividend by divisor
mov r23, r0 ; Move quotient into register r23
mov r24, r1 ; Move remainder into register r24
eor r25, r25 ; Clear carry flag

divs r17, r20 ; Divide next word of dividend by divisor
add r23, r0 ; Add quotient to register r23
adc r24, r1 ; Add remainder to register r24
adc r25, r25 ; Add carry flag to register r25

divs r18, r20 ; Divide next word of dividend by divisor
add r23, r0 ; Add quotient to register r23
adc r24, r1 ; Add remainder to register r24
adc r25, r25 ; Add carry flag to register r25

divs r19, r20 ; Divide highest word of dividend by divisor
add r23, r0 ; Add quotient to register r23
adc r24, r1 ; Add remainder to register r24
adc r25, r25 ; Add carry flag to register r25

; At this point, r23:r24 contains the 32-bit quotient, and r25 contains any remainder.

Credits: openAI

--- Updated Apr 10, 2023 ---

The div cmd uses 4 cycles

ldi r16, low(0xABCD) ; Load the low byte of the dividend
ldi r17, high(0xABCD) ; Load the high byte of the dividend
ldi r18, hh(0xABCD) ; Load the high byte of the high word of the dividend
ldi r19, 0x12 ; Load the divisor
div r19 ; Divide the 32-bit dividend by the 8-bit divisor

--- Updated Apr 10, 2023 ---

ldi r16, 0 ; clear the high 8 bits of the 16-bit divisor
ldi r17, 8-bit-divisor ; load the 8-bit divisor into the low 8 bits of the 16-bit divisor

movw r20, r22 ; copy the 32-bit dividend into the 16-bit registers r22:r23 and r24:r25
movw r22, r24

clr r26 ; clear the high 8 bits of the quotient
clr r27

div r20, r16 ; perform the hardware division operation, dividing the 32-bit dividend in r22:r23 and r24:r25 by the 8-bit divisor in r17

; the quotient is now in r26:r27, and the remainder is in r22:r23

KerimF · Apr 11, 2023

D.A.(Tony)Stewart said:
divs r16, r20 ; Divide low word of dividend by divisor

Thank you.

So DIVS is an instruction of the MCU family of... ?
I just have ATmega8

FvM · Apr 11, 2023

You are right, AVR has no divide instruction. DIVS looks like Motorola 68k processor.

KerimF · Apr 11, 2023

FvM said:
You are right, AVR has no divide instruction. DIVS looks like Motorola 68k processor.

It seems to me that Motorola 68k processor a CPU which needs an external memory for its instructions.
I am familiar with Z80 (in my early years of programming and by which I produced my first outdoor moving message signs using 25W/220V bulbs driven by triacs), then with AT89 family (or C51?), before ending up with AVR MCUs (when I was able downloading AVR studio. But its last version, I had the privilege to download, is 6.2).

For instance, I started writing code in machine language on paper, yes. bit by bit. since I couldn't get an assembler for Z80 (this was around year 1979). Manually, I had to enter these bits (using switches and push buttons) in order to save them in a set of SRAM ICs before saving them on an EPROM (which could be erased by ultraviolet). It took me about 3 months to debug my first code (to let it work) which can be written and debugged now in a few hours at most. Naturally, after I got an assembler for Z80 then for MCUs, I wrote all my programs in their assembly language. But I had to learn C in order to write PC programs (though I have a C compiler for DOS only) when necessary.

D.A.(Tony)Stewart · Apr 11, 2023

AVR32 code

KerimF · Apr 11, 2023

D.A.(Tony)Stewart said:
AVR32 code

I see. And I wonder for how long I will need to live before I will have the privilege (which almost all engineers around the world have) to work with AVR32 family and the like

Fortunately, AVR8 gave me the opportunity, even by using its old ATmega8 (and ATmega32 when real necessary), to produce products which can usually compete with imported similar ones, besides the ones for special applications.

Anyway, you did well (I say this based on trust since I personally cannot check your work

). And it is not your fault that I wasn't very clear that my question is for AVR8 (I just said 'AVR code' which certainly includes AVR32).

D.A.(Tony)Stewart · Apr 11, 2023

That's OK, as I didn't check my openAI answer either, which I supplied since I suspected no one else would.

KerimF · Apr 11, 2023

D.A.(Tony)Stewart said:
div r20, r16 ; perform the hardware division operation, dividing the 32-bit dividend in r22:r23 and r24:r25 by the 8-bit divisor in r17

; the quotient is now in r26:r27, and the remainder is in r22:r23

I guess now, I have to check the code you got, even not closely, since it is not yours

For example, could you please help me understand the last line (comment)?

If the divisor, r17, happens to be 2 and the 32-bit dividend, r22:r25, is 2^32-1, the quotient will be 2147483648 (after rounding) which also needs four 8-bit registers, not just two, as r26:r27 which are mentioned on the comment line above.

D.A.(Tony)Stewart · Apr 11, 2023

This is where human feedback is needed to verify openAI output and give reasons why the result is bad so it learns just like humans from mistakes

KerimF · Apr 11, 2023

D.A.(Tony)Stewart said:
This is where human feedback is needed to verify openAI output and give reasons why the result is bad so it learns just like humans from mistakes

So, I wonder how old this openAI is?
I mean I started learning from my mistakes since I perceived my special existence, as every human is special. This happened since age 14. Before this, I was usually guided by some adults (my father died when I was 9) who used to tell me what is supposed to be good or bad for me.
Till now, I am in the leaning stage

KerimF · Apr 14, 2023

I think it is time, as I promised, to post what I did best for dividing (by an AVR8 code) a 32-bit dividend by an 8-bit register, with rounding:
For N=0 or 1, please see notes at the end of the code.

Kerim

Code:

;=============================================================
; *R* Division of 32-bit dividend {A} by 8-bit divisor {N} ***
;     and a Kd table for the {N} = 2 to 255
;=============================================================

; {R} = {A}/{N} = r31,r30,r29,r28 / {N}
;     = r31,r30,r29,r28 * {Kd} /256/256/256/256 then rounding
; {A} dividend in r19,r18,r17,r16 [0 to 2^32-1]
; {N} constant divisor in r10 [N=0 to 255, see Kd32b2B]
; {R} result in r31,r30,r29,r28 = {A}/{N} rounded
; {Kd} = round(256*256*256*256/{N},0) , the division constant
; {Kr} = INT[ (1-{N})/2 ] , the rounding constant
; not used registers: r25:r24, r9:r2
; 114 [code] + 512 [table] = 626 words, *2 = 1254 bytes
; 131, 140 or 142 cycles = 124c|133c|135c code + 4c RET + 3c RCALL

D32var8:
    MOV   r26, r10              ;1abc
    CPI   r26, 1                ;1abc, Note_2
    BREQ  DO_same               ;1abc, Note_2

    TST   r26                   ;1abc, Note_1 and Note_2
    BREQ  DOoverF               ;1abc, Note_1 and Note_2

    CLR   r27                   ;1abc
    LSL   r26                   ;1abc
    ROL   r27                   ;1abc
    LSL   r26                   ;1abc
    ROL   r27                   ;1abc
; r27:r26 = r26 * 4 [table index]

    LDI   ZH, high(Kd32b2B*2)   ;1abc, address in bytes, 0x??00
    MOV   ZL, r26               ;1abc
    ADD   ZH, r27               ;1abc
    LPM   r20, Z+               ;3abc
    LPM   r21, Z+               ;3abc
    LPM   r22, Z+               ;3abc
    LPM   r23, Z+               ;3abc,  [25abc]
; r23:r20 = reg_Kd

; multiplicand in r19,r18,r17,r16 = {A}
; multiplier   in r23,r22,r21,r20 = {Kd}
; mul. result  in r31:r26, r14:r13
; valid result in r31,r30,r29,r28

    MUL   r16, r20              ;2abc
    MOV   r13, r0               ;1abc
    MOV   r14, r1               ;1abc
    MUL   r17, r21              ;2abc
    MOVW  r27:r26, r1:r0        ;1abc
    MUL   r18, r22              ;2abc
    MOVW  r29:r28, r1:r0        ;1abc
    MUL   r19, r23              ;2abc
    MOVW  r31:r30, r1:r0        ;1abc

    MUL   r17, r20              ;2abc
    ADD   r14, r0               ;1abc
    ADC   r26, r1               ;1abc
    MUL   r18, r20              ;2abc
    ADD   r26, r0               ;1abc
    ADC   r27, r1               ;1abc
    MUL   r19, r20              ;2abc
    ADD   r27, r0               ;1abc
    ADC   r28, r1               ;1abc

    MUL   r16, r21              ;2abc
    ADD   r14, r0               ;1abc
    ADC   r26, r1               ;1abc
    ADC   r27, r15              ;1abc
    ADC   r28, r15              ;1abc
    ADC   r29, r15              ;1abc
    MUL   r18, r21              ;2abc
    ADD   r27, r0               ;1abc
    ADC   r28, r1               ;1abc
    ADC   r29, r15              ;1abc
    MUL   r19, r21              ;2abc
    ADD   r28, r0               ;1abc
    ADC   r29, r1               ;1abc

    MUL   r16, r22              ;2abc
    ADD   r26, r0               ;1abc
    ADC   r27, r1               ;1abc
    ADC   r28, r15              ;1abc
    ADC   r29, r15              ;1abc
    ADC   r30, r15              ;1abc
    MUL   r17, r22              ;2abc
    ADD   r27, r0               ;1abc
    ADC   r28, r1               ;1abc
    ADC   r29, r15              ;1abc
    ADC   r30, r15              ;1abc
    MUL   r19, r22              ;2abc
    ADD   r29, r0               ;1abc
    ADC   r30, r1               ;1abc

    MUL   r16, r23              ;2abc
    ADD   r27, r0               ;1abc
    ADC   r28, r1               ;1abc
    ADC   r29, r15              ;1abc
    ADC   r30, r15              ;1abc
    ADC   r31, r15              ;1abc
    MUL   r17, r23              ;2abc
    ADD   r28, r0               ;1abc
    ADC   r29, r1               ;1abc
    ADC   r30, r15              ;1abc
    ADC   r31, r15              ;1abc
    MUL   r18, r23              ;2abc
    ADD   r29, r0               ;1abc
    ADC   r30, r1               ;1abc
    ADC   r31, r15              ;1abc, +77abc [102abc]
; {B} = r31:r28 = r19:r16 * r23:r20 /256/256/256
; {B} = {A}*{Kd}/256/256/256
; {R} = {B} or {B}+1

; for rounding
; multiplicand in r31,r28 = {B}
; multiplier   in r10 = {N}
; valid result in r23,r20
    MUL   r28, r10              ;2abc
    MOVW  r21:r20, r1:r0        ;1abc
    CLR   r22                   ;1abc
    CLR   r23                   ;1abc
    MUL   r29, r10              ;2abc
    ADD   r21, r0               ;1abc
    ADC   r22, r1               ;1abc
    MUL   r30, r10              ;2abc
    ADD   r22, r0               ;1abc
    ADC   r23, r1               ;1abc
    MUL   r31, r10              ;2abc
    ADD   r23, r0               ;1abc
; {C} = r23:r20 = {B}*{N} = r31:r28 * r10

; the following conditions were deduced empirically
; if( Carry_1=0, {R}={B}, if( Zero_2=1 OR Carry_2=0, {R}={B}, {R}={B}+1 ) )

    SUB   r20, r16              ;1abc
    SBC   r21, r17              ;1abc
    SBC   r22, r18              ;1abc
    SBC   r23, r19              ;1abc
; {D1} = r23:r20 = {C} - {A} = r23:r20 - r19:r16

    BRCC  DIV_ret               ;2a|1bc, +22a [124a]
; if Carry_1=0, {R}={B}

    MOV   r26, r10              ;1bc
    DEC   r26                   ;1bc
    LSR   r26                   ;1bc
    NEG   r26                   ;1bc
; r26 = reg_Kr

; r26 is negative always
    SUB   r20, r26              ;1bc
    SBCI  r21, 0xFF             ;1bc
    SBCI  r22, 0xFF             ;1bc
    SBCI  r23, 0xFF             ;1bc
; {D2} = r23:r20 = {D1} + {-Kr} = r23:r20 + {-Kr}

    BRPL  DIV_ret               ;2b|1c, +31b [134b]
; if {D1} positive, {R}={B}

    ADIW  r29:r28, 1            ;1c
    SBCI  r30, byte3(-1)        ;1c
    SBCI  r31, byte4(-1)        ;1c, +33c [135c]
; {R}={B}+1

DIV_ret:
    RET                         ;4
; {R} = r31:r28 = {B} or {B}+1 [ {A}/{N} rounded ]

DOoverF:                        ; Note 2
    SER   r28                   ; Note 2
    SER   r29                   ; Note 2
    SER   r30                   ; Note 2
    SER   r31                   ; Note 2
    RET                         ; Note 2
; {R} = r31:r28 = 0xFFFFFFFF

DO_same:                        ; Note 2
    MOVW  r29:r28, r17:r16      ; Note 2
    MOVW  r31:r30, r19:r19      ; Note 2
    RET                         ; Note 2
; {R} = r31:r28 = r19:r16

;====================================

    .org 0x??00

; read Kd_L, Kd_H, Kd_U then Kd_Q
;  x [usable index] = 1 < x < 256, 1024-word table
; 256 entries * 2 words = 1024 bytes [512 words]
Kd32b2B:
    .dw   0,0,0,0,0,32768,21845,21845
    .dw   0,16384,13107,13107,43691,10922,18725,9362
    .dw   0,8192,50972,7281,39322,6553,53620,5957
    .dw   21845,5461,15124,5041,9362,4681,4369,4369
    .dw   0,4096,3855,3855,58254,3640,17246,3449
    .dw   52429,3276,49932,3120,59578,2978,25645,2849
    .dw   43691,2730,28836,2621,40330,2520,16991,2427
    .dw   37449,2340,56497,2259,34953,2184,4228,2114
    .dw   0,2048,61564,1985,34696,1927,29959,1872
    .dw   29127,1820,15941,1771,41391,1724,26887,1680
    .dw   26214,1638,28772,1598,24966,1560,6096,1524
    .dw   29789,1489,23302,1456,45590,1424,25099,1394
    .dw   21845,1365,30762,1337,47186,1310,1285,1285
    .dw   20165,1260,34623,1236,41263,1213,36938,1191
    .dw   18725,1170,49439,1149,61016,1129,51096,1110
    .dw   17476,1092,23636,1074,2114,1057,16644,1040
    .dw   0,1024,16132,1008,63550,992,9781,978
    .dw   50116,963,52239,949,14980,936,2769,923
    .dw   14564,910,49376,897,40739,885,53303,873
    .dw   20696,862,7660,851,13443,840,37331,829
    .dw   13107,819,5664,809,14386,799,38690,789
    .dw   12483,780,771,771,3048,762,18832,753
    .dw   47663,744,23564,736,11651,728,11523,720
    .dw   22795,712,45100,704,12549,697,55878,689
    .dw   43691,682,41213,675,48149,668,64212,661
    .dw   23593,655,57101,648,33411,642,17816,636
    .dw   10082,630,9986,624,17311,618,31849,612
    .dw   53400,606,16234,601,51237,595,27159,590
    .dw   9362,585,63216,579,57488,574,57558,569
    .dw   63276,564,8962,560,25548,555,47362,550
    .dw   8738,546,40621,541,11818,537,53281,532
    .dw   33825,528,18874,524,8322,520,2064,516
    .dw   0,512,2032,508,8066,504,18010,500
    .dw   31775,496,49275,492,4891,489,29613,485
    .dw   57826,481,23918,478,58887,474,31589,471
    .dw   7490,468,52057,464,34153,461,19248,458
    .dw   7282,455,63728,451,57456,448,53945,445
    .dw   53137,442,54980,439,59419,436,868,434
    .dw   10348,431,22274,428,36598,425,53274,422
    .dw   6722,420,27968,417,51433,414,11541,412
    .dw   39322,409,3664,407,35600,404,4021,402
    .dw   39961,399,12313,397,52113,394,28255,392
    .dw   6242,390,51576,387,33154,385,16480,383
    .dw   1524,381,53793,378,42184,376,32206,374
    .dw   23831,372,17032,370,11782,368,8055,366
    .dw   5825,364,5069,362,5761,360,7879,358
    .dw   11398,356,16295,354,22550,352,30140,350
    .dw   39043,348,49239,346,60707,344,7892,343
    .dw   21845,341,37013,339,53375,337,5377,336
    .dw   24074,334,43912,332,64874,330,21406,329
    .dw   44564,327,3260,326,28550,324,54882,322
    .dw   16705,321,45076,319,8908,318,39258,316
    .dw   5041,315,37315,313,4993,312,39135,310
    .dw   8656,309,44614,307,15925,306,53648,304
    .dw   26700,303,604,302,40885,300,16459,299
    .dw   58387,297,35585,296,13580,295,57895,293
    .dw   37449,292,17768,291,64376,289,46193,288
    .dw   28744,287,12020,286,61547,284,46244,283
    .dw   31638,282,17720,281,4481,280,57449,278
    .dw   45542,277,34289,276,23681,275,13710,274
    .dw   4369,273,61185,271,53079,270,45579,269
    .dw   38677,268,32367,267,26641,266,21492,265
    .dw   16913,264,12897,263,9437,262,6527,261
    .dw   4161,260,2331,259,1032,258,257,257

/*
Note_1:
If N=0, the returned result [r31:r28] is 2^32-1 = 4294967295
4294967295 is the highest possible number to replace infinity.
But, this could be replaced by setting an overflow flag, for example.

Note_2:
In case reg_N [r10] won’t be 0 or 1 for sure,
"CPI r26, 1" and " TST r26" at the beginning of the routine could be removed,
with their the codes at "DOoverF:" and "DO_same:"
*/

Welcome to EDAboard.com

ASM02 - How Fast Dividing, with Rounding, a 32-Bit dividend by an 8-Bit Divisor Could be?

KerimF

Advanced Member level 5

KerimF

Advanced Member level 5

D.A.(Tony)Stewart

Advanced Member level 7

KerimF

Advanced Member level 5

FvM

Super Moderator

KerimF

Advanced Member level 5

D.A.(Tony)Stewart

Advanced Member level 7

KerimF

Advanced Member level 5

D.A.(Tony)Stewart

Advanced Member level 7

KerimF

Advanced Member level 5

D.A.(Tony)Stewart

Advanced Member level 7

KerimF

Advanced Member level 5

KerimF

Advanced Member level 5

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Connect with us

Online statistics

Forum statistics