Quick Links
How to efficiently sum 4 x 8bit integers with ARM or NEON To shrink an image by 4, how to quickly sum 4 pixels
#1
Posted 17 September 2010 - 05:33 AM
I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly. From what I have read, NEON needs atleast 32-bit integers and VFP is for floats, so it looks like I should just stick with ARM (or Thumb-2) instructions.
But I'm just a beginner so I'm wondering if there is a more efficient method of summing 4 consecutive bytes than convert each byte to a 32bit int and then sum them (and then shift right to get the average).
Its for a Cortex-A8 (ARMv7-A), and the data is aligned to 32 bytes or whatever I want.
Cheers,
Shervin Emami
http://www.shervinemami.co.cc/
#2
Posted 17 September 2010 - 02:37 PM
shervin, on Sep 17 2010, 06:33 AM, said:
The trick would be to abuse the (ARM or Thumb2) USADA8 instruction I think - this performs a "sum of absolute differences" - so it subtracts each byte in one word from another, and sums the absolute value of the 4 resulting byte values. If you give the value to difference against as 0, then the difference _is_ the original value in your 4 byte vector.
MOV r0, #0 LDR r1, [r2]! USADA8 r3, r0, r1
P.S. given that you want to put a gap between the load of r1 and the use of it on most modern ARM cores, you may want to do something like:
MOV r0, #0 LDR r1, [r12]! LDR r2, [r12]! LDR r3, [r12]! LDR r4, [r12]! USADA8 r5, r0, r1 USADA8 r6, r0, r2 USADA8 r7, r0, r3 USADA8 r8, r0, r4
This post has been edited by isogen74: 22 September 2010 - 01:46 PM
#3
Posted 17 September 2010 - 07:56 PM
+---+---+
| A | B |
+---+---+ ---> (A+B+C+D)/4
| C | D |
+---+---+
Then the following [untested] Neon code might not be too far off optimal.
;; Average square of four pixels to single pixel.
;; Produces NxM pixel image from 2Nx2M pixel image.
;; Generates 16 output pixels per loop.
;; May over-read by upto 63 bytes.
;; May over-write by upto 15 bytes.
;; r0 = Input line start address
;; r1 = Input line width in bytes
;; r2 = Input line total size in bytes
;; r3 = Output line start address
quad FUNC
;; Compute start of second line and end address
ADD r1,r1,r0
ADD r2,r2,r1
1
;; Load 32 pixels from each of two rows
VLD1.8 {Q0,Q1},[r0]!
VLD1.8 {Q2,Q3},[r1]!
;; Sum neighbouring 8-bits in each row to 16-bits
VPADDL.U8 Q0,Q0
VPADDL.U8 Q1,Q1
VPADDL.U8 Q2,Q2
VPADDL.U8 Q3,Q3
;; Sum 16-bit values vertically
VADD.U16 Q0,Q0,Q2
VADD.U16 Q1,Q1,Q3
;; Divide each sum of four pixels by 4 and cast to char
VSHRN.U16 D0,Q0,#2
VSHRN.U16 D1,Q1,#2
;; Store 16 pixels of resized image
VST1.8 {Q0},[r3]!
;; Loop if not past end of image
CMP r1,r2
BLE %b1
;; Return from function
BX lr
ENDFUNC
hth
s.
This post has been edited by sim: 18 September 2010 - 08:28 AM
#4
Posted 21 September 2010 - 06:45 AM
LDR r0, [r4] // Load 4 pixels A:B:C:D from (x,y) LDR r1, [r5] // Load 4 pixels E:F:G:H from (x,y+1) UHADD8 r2, r0, r1 // Add pixels A:B:C:D with pixels E:F:G:H and divide each pixel by 2. UXTB r3, r2 // Set r3 = (D+H)/2 UXTAB r3, r3, r2, ROR #8 // Set r3 = r3 + (C+G)/2 UXTAB r3, r3, r2, ROR #16 // Set r3 = r3 + (B+F)/2 UXTAB r3, r3, r2, ROR #24 // Set r3 = r3 + (A+E)/2 // r3 is now (A+E)/2 + (B+F)/2 + (C+G)/2 + (D+H)/2 // which is (A+B+C+D + E+F+G+H) / 2 LSR r3, r3, 2 // Set r3 = average of 8 pixels A to H
So obviously your 2 solutions are much better! I used to be an Assembly programmer about 10 years ago for Intel 16bit and 32bit CPUs but I only just started learning ARM last week, and now I finally see why so many people used to say that ARM RISC is better than Intel CISC! NEON seems really powerful, and I'm amazed that Thumb-2 can fit something like USADA8 in just 16-bits!
Glad to finally be part of the ARM community :-)
Cheers,
Shervin Emami
http://www.shervinemami.co.cc/
#5
Posted 22 September 2010 - 01:52 PM
shervin, on Sep 21 2010, 07:45 AM, said:
Glad to finally be part of the ARM community :-)
No probs; glad to be of help. And welcome =) A good question to ask too - I love answering assembler hacking questions
I've programmed assembler on a couple of register based architectures (ARM and TI DSPs mainly), and I have to say whenever I look at writing x86 CISC assembler I really get put off by it (mostly I just find register based architectures more intuitive). The more recent versions of the ARM architecture are really nice to write algorithms for; a mixture of ARM DSP and SIMD instructions,some of the newer ARM instructions in ARMv7 such as the wide constant loads, and of course NEON, make it really very flexible and a pleasure to write in =)
This post has been edited by isogen74: 22 September 2010 - 01:53 PM
#6
Posted 25 September 2010 - 12:38 PM
After spending about 1 week to optimise my 2 resizing functions and learn NEON and create NEON versions, I have some timing results that you guys might be interested in knowing. As part of my project I made separate functions to divide the number of pixels by 4 (half width and half height) or to divide the number of pixels by 16 (quarter width and quarter height), where the resizing is done by adding all the 2x2 or 4x4 pixel values and dividing by 4 or 16.
To shrink a 480x360 pixel greyscale image by 16 (quarter width and quarter height) on a Cortex-A8 (iPhone 3GS):
My optimised C code without intrinsics (GCC4.2 with -O3) takes about 27 msec
My hand-optimised ARMv7 Assembly code takes about 7.8 msec
My hand-optimised NEON Assembly code takes about 0.9 msec
So I'm very happy to see the resizing function that was the bottleneck of my program is now 4200% faster! Especially since my optimised C code is structured to do the exact same thing as what my ARM assembly code does, but obviously the C compiler didn't agree!
Cheers,
Shervin Emami.
http://www.shervinemami.co.cc/
#7
Posted 26 September 2010 - 08:30 PM
;; consume 256 source image pixels
VLD1.8 {Q0,Q1},[r1@128]!; load 32 from row 0
VLD1.8 {Q4,Q5},[r2]! ; load 32 from row 1
VLD1.8 {Q8,Q9},[r3@64]!; load 32 from row 2
VLD1.8 {Q12,Q13},[r4]! ; load 32 from row 3
VLD1.8 {Q2,Q3},[r1@128]!; load another 32 from row 0
VLD1.8 {Q6,Q7},[r2]! ; load another 32 from row 1
VLD1.8 {Q10,Q11},[r3@64]!; load another 32 from row 2
VLD1.8 {Q14,Q15},[r4]! ; load another 32 from row 3
;; now at 256 8-bit values
VPADDL.u8 Q0,Q0; 8 adds
VPADDL.u8 Q1,Q1; 8 adds
VPADDL.u8 Q2,Q2; 8 adds
VPADDL.u8 Q3,Q3; 8 adds
VPADAL.u8 Q0,Q4; 16 adds
VPADAL.u8 Q1,Q5; 16 adds
VPADAL.u8 Q2,Q6; 16 adds
VPADAL.u8 Q3,Q7; 16 adds
VPADAL.u8 Q0,Q8; 16 adds
VPADAL.u8 Q1,Q9; 16 adds
VPADAL.u8 Q2,Q10; 16 adds
VPADAL.u8 Q3,Q11; 16 adds
VPADAL.u8 Q0,Q12; 16 adds
VPADAL.u8 Q1,Q13; 16 adds
VPADAL.u8 Q2,Q14; 16 adds
VPADAL.u8 Q3,Q15; 16 adds
;; now at 32 16-bit values
VPADD.u16 Q0,Q0,Q1; 8 adds
VPADD.u16 Q1,Q2,Q3; 8 adds
;; now at 16 16-bit values
VSHRN.u16 D0,Q0,#4; 8 divides by 16
VSHRN.u16 D1,Q1,#4; 8 divides by 16
;; now at 16 8-bit values
;; write out 16 destination image pixels
VST1.8 {Q0},[r0@64]!; store 16
Pulling in 256 pixels (filling the entire Neon register file) and emitting 16 per iteration.
s.
#8
Posted 26 September 2010 - 11:59 PM
And one question about your code: You specify @128 alignment for 2 of your instructions and @64 for the other 2 loads & store. The timing diagram says that @64 is the max alignment it can take advantage of in VLD1.8, so is there a reason you wrote @128 for some of your instructions and not others?
#9
Posted 27 September 2010 - 07:37 AM
shervin, on Sep 27 2010, 12:59 AM, said:
I believe GCC uses ":" rather than "@", as "@" is the GCC comment character.
Quote
I wasn't assuming any particular processor was in use, I simply provided the largest alignment that could be guaranteed for the given multiple of 480 bytes assuming the source image started of 128byte aligned.
hth
s.
#10
Posted 27 September 2010 - 08:01 AM
Something like:
;; now at 256 8-bit values ... VPADDL.u8 Q3,Q3; 8 adds PLD [r1,#((4*640)-256)] ... VPADAL.u8 Q3,Q7; 16 adds PLD [r2,#((4*640)-256)] ... VPADAL.u8 Q3,Q11; 16 adds PLD [r3,#((4*640)-256)] ... VPADAL.u8 Q3,Q15; 16 adds PLD [r4,#((4*640)-256)] ;; now at 32 16-bit values
s.
#11
Posted 27 September 2010 - 09:27 AM
Quote
Very few compilers generate "weird" instructions - so if you are after anything a little special in the instruction set the odds are you will either need to use intrinsics for that instruction or fall back to assembler.
#12
Posted 27 September 2010 - 11:23 AM
sim, on Sep 27 2010, 08:01 AM, said:
Its funny I was thinking of asking you about memory preloading but I thought I had already asked too much of your time as it is :-) From the few message posts I've read about NEON optimisation (I think mainly in the FFmpeg msg boards), they say that memory preloading involves some trial & error to get the right values in the right places?
I tried aligning in GCC using:
VLD1.u8 {q0}, [r0:128]!
but it still gives an error, and I tried every keyboard symbol in place of @ but it still wont work. I'll try using NASM instead.
Anyway I still don't understand why you aligned some to @128 and some to @64 and some to nothing. Wouldn't it work better if all 8 loads & the store use align (such as @64 on everything if its a 480 pixel wide image or @128 if its a 640 pixel wide image)?
Thanks a lot for your help! I'm still contemplating whether to attempt a generic image resizing function (from any size to any size) using NEON or whether it would be too difficult to take advantage of SIMD for that type of operation.
Cheers,
Shervin Emami.
http://www.shervinemami.co.cc/
#13
Posted 27 September 2010 - 12:35 PM
Like I said, thanks a lot both of you for your help with jump starting me on ARM and NEON development.
#14
Posted 27 September 2010 - 01:42 PM
shervin, on Sep 27 2010, 01:35 PM, said:
There are some known bugs with the early implementations of alignment annotations in GNU Assembler Syntax, so depending how old your compiler is (4.2 is quite old, so I think suffers from this bug) you may have to bodge your code to use the old syntax.
In summary - the buggy implementation needed a extra ',' between the register and the alignment.
@ Buggy form, which works on older GAS assembler
VLD1.8 {d0}, [r1, :128]
@ Correct version which works in new GAS assembler (old form still supported though)
VLD1.8 {d0}, [r1 :128]
See ...
http://www.listware.net/201006/gnu-binutil...acceptance.html
This post has been edited by isogen74: 27 September 2010 - 01:49 PM
#16
Posted 03 October 2010 - 09:52 AM
isogen74, on Sep 27 2010, 02:42 PM, said:
Yes you are right, it works when I use:
vld1.8 {d0}, [r1, :128]
Thanks! I actually posted the issue on the gcc-help mailing list and got a reply from Richard Earnshaw at ARM saying that it is a bug in old versions of the assembler in binutils (not the gcc compiler), and that:
Quote
Now I'm ready to start making more optimized functions :-) This is my first time trying to write SIMD code, so I'm wondering, is there any websites or something that show tricks of the trade or useful advice for writing SIMD code by hand? Otherwise I'll just try to figure it out myself based on the ARM + NEON instruction set.
Cheers,
Shervin Emami.
#17
Posted 15 October 2010 - 07:59 AM
















