Login

Important information

This site uses cookies to store information on your computer. By continuing to use our site, you consent to our cookies.

ARM websites use two types of cookie: (1) those that enable the site to function and perform as required; and (2) analytical cookies which anonymously track visitors only while using the site. If you are not happy with this use of these cookies please review our Privacy Policy to learn how they can be disabled. By disabling cookies some features of the site will not work.

ARM Community: How to efficiently sum 4 x 8bit integers with ARM or NEON - ARM Community

Jump to content

Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic

How to efficiently sum 4 x 8bit integers with ARM or NEON To shrink an image by 4, how to quickly sum 4 pixels Rate Topic: -----

#1 User is offline   shervin 

  • Contributor
  • PipPip
  • Group: Members
  • Posts: 52
  • Joined: 17-September 10

Posted 17 September 2010 - 05:33 AM

Hi,

I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly. From what I have read, NEON needs atleast 32-bit integers and VFP is for floats, so it looks like I should just stick with ARM (or Thumb-2) instructions.

But I'm just a beginner so I'm wondering if there is a more efficient method of summing 4 consecutive bytes than convert each byte to a 32bit int and then sum them (and then shift right to get the average).

Its for a Cortex-A8 (ARMv7-A), and the data is aligned to 32 bytes or whatever I want.

Cheers,
Shervin Emami
http://www.shervinemami.co.cc/
0

#2 User is offline   isogen74 

  • Super Contributor
  • PipPipPipPip
  • Group: Members
  • Posts: 1098
  • Joined: 20-March 07

Posted 17 September 2010 - 02:37 PM

View Postshervin, on Sep 17 2010, 06:33 AM, said:

I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly.


The trick would be to abuse the (ARM or Thumb2) USADA8 instruction I think - this performs a "sum of absolute differences" - so it subtracts each byte in one word from another, and sums the absolute value of the 4 resulting byte values. If you give the value to difference against as 0, then the difference _is_ the original value in your 4 byte vector.

MOV r0, #0
LDR r1, [r2]!
USADA8 r3, r0, r1


P.S. given that you want to put a gap between the load of r1 and the use of it on most modern ARM cores, you may want to do something like:

MOV r0, #0
LDR r1, [r12]!
LDR r2, [r12]!
LDR r3, [r12]!
LDR r4, [r12]!
USADA8 r5, r0, r1
USADA8 r6, r0, r2
USADA8 r7, r0, r3
USADA8 r8, r0, r4

This post has been edited by isogen74: 22 September 2010 - 01:46 PM

When optimizing software, consider that the quickest code to run is the bit you removed from the call path.
0

#3 User is offline   sim 

  • Regular Contributor
  • PipPipPip
  • Group: Members.
  • Posts: 419
  • Joined: 04-October 06

Posted 17 September 2010 - 07:56 PM

Assuming your four pixels are in a square, i.e.:

+---+---+
| A | B |
+---+---+ ---> (A+B+C+D)/4
| C | D |
+---+---+


Then the following [untested] Neon code might not be too far off optimal.

;; Average square of four pixels to single pixel.
;; Produces NxM pixel image from 2Nx2M pixel image.
;; Generates 16 output pixels per loop.
;; May over-read by upto 63 bytes.
;; May over-write by upto 15 bytes.

;; r0 = Input line start address
;; r1 = Input line width in bytes
;; r2 = Input line total size in bytes
;; r3 = Output line start address

quad	FUNC
;; Compute start of second line and end address
	ADD		r1,r1,r0 
	ADD		r2,r2,r1

1
;; Load 32 pixels from each of two rows
	VLD1.8		{Q0,Q1},[r0]!
	VLD1.8		{Q2,Q3},[r1]!

;; Sum neighbouring 8-bits in each row to 16-bits
	VPADDL.U8	Q0,Q0
	VPADDL.U8	Q1,Q1
	VPADDL.U8	Q2,Q2
	VPADDL.U8	Q3,Q3

;; Sum 16-bit values vertically 
	VADD.U16	Q0,Q0,Q2
	VADD.U16	Q1,Q1,Q3

;; Divide each sum of four pixels by 4 and cast to char
	VSHRN.U16	D0,Q0,#2
	VSHRN.U16	D1,Q1,#2

;; Store 16 pixels of resized image
	VST1.8		{Q0},[r3]!

;; Loop if not past end of image
	CMP		r1,r2
	BLE		%b1

;; Return from function
	BX		lr
	ENDFUNC


hth
s.

This post has been edited by sim: 18 September 2010 - 08:28 AM

0

#4 User is offline   shervin 

  • Contributor
  • PipPip
  • Group: Members
  • Posts: 52
  • Joined: 17-September 10

  Posted 21 September 2010 - 06:45 AM

Wow thanks so much guys, thats exactly what I needed to know! I still haven't learnt enough of NEON to have used it and I totally overlooked USADA8 for this operation, so until now I just came up with something like this:

	LDR	r0, [r4]		// Load 4 pixels A:B:C:D from (x,y)
	LDR	r1, [r5]		// Load 4 pixels E:F:G:H from (x,y+1)
	UHADD8	r2, r0, r1		// Add pixels A:B:C:D with pixels E:F:G:H and divide each pixel by 2.
	UXTB	r3, r2			// Set r3 = (D+H)/2
	UXTAB	r3, r3, r2, ROR #8	// Set r3 = r3 + (C+G)/2
	UXTAB	r3, r3, r2, ROR #16	// Set r3 = r3 + (B+F)/2
	UXTAB	r3, r3, r2, ROR #24	// Set r3 = r3 + (A+E)/2
	// r3 is now (A+E)/2 + (B+F)/2 + (C+G)/2 + (D+H)/2
	// which is (A+B+C+D + E+F+G+H) / 2
	LSR		r3, r3, 2	// Set r3 = average of 8 pixels A to H


So obviously your 2 solutions are much better! I used to be an Assembly programmer about 10 years ago for Intel 16bit and 32bit CPUs but I only just started learning ARM last week, and now I finally see why so many people used to say that ARM RISC is better than Intel CISC! NEON seems really powerful, and I'm amazed that Thumb-2 can fit something like USADA8 in just 16-bits!

Glad to finally be part of the ARM community :-)

Cheers,
Shervin Emami
http://www.shervinemami.co.cc/
0

#5 User is offline   isogen74 

  • Super Contributor
  • PipPipPipPip
  • Group: Members
  • Posts: 1098
  • Joined: 20-March 07

Posted 22 September 2010 - 01:52 PM

View Postshervin, on Sep 21 2010, 07:45 AM, said:

Wow thanks so much guys, thats exactly what I needed to know!
Glad to finally be part of the ARM community :-)


No probs; glad to be of help. And welcome =) A good question to ask too - I love answering assembler hacking questions :)

I've programmed assembler on a couple of register based architectures (ARM and TI DSPs mainly), and I have to say whenever I look at writing x86 CISC assembler I really get put off by it (mostly I just find register based architectures more intuitive). The more recent versions of the ARM architecture are really nice to write algorithms for; a mixture of ARM DSP and SIMD instructions,some of the newer ARM instructions in ARMv7 such as the wide constant loads, and of course NEON, make it really very flexible and a pleasure to write in =)

This post has been edited by isogen74: 22 September 2010 - 01:53 PM

When optimizing software, consider that the quickest code to run is the bit you removed from the call path.
0

#6 User is offline   shervin 

  • Contributor
  • PipPip
  • Group: Members
  • Posts: 52
  • Joined: 17-September 10

  Posted 25 September 2010 - 12:38 PM

Hi guys,

After spending about 1 week to optimise my 2 resizing functions and learn NEON and create NEON versions, I have some timing results that you guys might be interested in knowing. As part of my project I made separate functions to divide the number of pixels by 4 (half width and half height) or to divide the number of pixels by 16 (quarter width and quarter height), where the resizing is done by adding all the 2x2 or 4x4 pixel values and dividing by 4 or 16.

To shrink a 480x360 pixel greyscale image by 16 (quarter width and quarter height) on a Cortex-A8 (iPhone 3GS):

Resize function in OpenCV C library (GCC4.2 with -O3) takes about 38 msec
My optimised C code without intrinsics (GCC4.2 with -O3) takes about 27 msec
My hand-optimised ARMv7 Assembly code takes about 7.8 msec
My hand-optimised NEON Assembly code takes about 0.9 msec


So I'm very happy to see the resizing function that was the bottleneck of my program is now 4200% faster! Especially since my optimised C code is structured to do the exact same thing as what my ARM assembly code does, but obviously the C compiler didn't agree!

Cheers,
Shervin Emami.
http://www.shervinemami.co.cc/
0

#7 User is offline   sim 

  • Regular Contributor
  • PipPipPip
  • Group: Members.
  • Posts: 419
  • Joined: 04-October 06

Posted 26 September 2010 - 08:30 PM

Presumably your 4x4/16 inner loop for Neon is something close to:

;; consume 256 source image pixels
	VLD1.8	{Q0,Q1},[r1@128]!; load 32 from row 0
	VLD1.8	{Q4,Q5},[r2]!	; load 32 from row 1
	VLD1.8	{Q8,Q9},[r3@64]!; load 32 from row 2
	VLD1.8	{Q12,Q13},[r4]!	; load 32 from row 3
	VLD1.8	{Q2,Q3},[r1@128]!; load another 32 from row 0
	VLD1.8	{Q6,Q7},[r2]!	; load another 32 from row 1
	VLD1.8	{Q10,Q11},[r3@64]!; load another 32 from row 2
	VLD1.8	{Q14,Q15},[r4]!	; load another 32 from row 3

;; now at 256 8-bit values

	VPADDL.u8	Q0,Q0; 8 adds
	VPADDL.u8	Q1,Q1; 8 adds
	VPADDL.u8	Q2,Q2; 8 adds
	VPADDL.u8	Q3,Q3; 8 adds

	VPADAL.u8	Q0,Q4; 16 adds
	VPADAL.u8	Q1,Q5; 16 adds
	VPADAL.u8	Q2,Q6; 16 adds
	VPADAL.u8	Q3,Q7; 16 adds

	VPADAL.u8	Q0,Q8; 16 adds
	VPADAL.u8	Q1,Q9; 16 adds
	VPADAL.u8	Q2,Q10; 16 adds
	VPADAL.u8	Q3,Q11; 16 adds

	VPADAL.u8	Q0,Q12; 16 adds
	VPADAL.u8	Q1,Q13; 16 adds
	VPADAL.u8	Q2,Q14; 16 adds
	VPADAL.u8	Q3,Q15; 16 adds

;; now at 32 16-bit values

	VPADD.u16	Q0,Q0,Q1; 8 adds
	VPADD.u16	Q1,Q2,Q3; 8 adds

;; now at 16 16-bit values

	VSHRN.u16	D0,Q0,#4; 8 divides by 16
	VSHRN.u16	D1,Q1,#4; 8 divides by 16

;; now at 16 8-bit values

;; write out 16 destination image pixels
	VST1.8	{Q0},[r0@64]!; store 16


Pulling in 256 pixels (filling the entire Neon register file) and emitting 16 per iteration.

s.
0

#8 User is offline   shervin 

  • Contributor
  • PipPip
  • Group: Members
  • Posts: 52
  • Joined: 17-September 10

  Posted 26 September 2010 - 11:59 PM

Actually I just used VPADDL's instead of VPADAL, so your code should run even faster than myne :-) But there is one other important difference: My code doesn't specify the data alignment (such as @64 and @128), because I'm using the default assembler in XCode (GCC4.2 -assembler-as-cpp), and I can't figure out how to specify the NEON data alignment. Maybe I should be using NASM or something instead of GCC to assemble my code...

And one question about your code: You specify @128 alignment for 2 of your instructions and @64 for the other 2 loads & store. The timing diagram says that @64 is the max alignment it can take advantage of in VLD1.8, so is there a reason you wrote @128 for some of your instructions and not others?
0

#9 User is offline   sim 

  • Regular Contributor
  • PipPipPip
  • Group: Members.
  • Posts: 419
  • Joined: 04-October 06

Posted 27 September 2010 - 07:37 AM

View Postshervin, on Sep 27 2010, 12:59 AM, said:

I can't figure out how to specify the NEON data alignment


I believe GCC uses ":" rather than "@", as "@" is the GCC comment character.

Quote

And one question about your code: You specify @128 alignment for 2 of your instructions and @64 for the other 2 loads & store. The timing diagram says that @64 is the max alignment it can take advantage of in VLD1.8, so is there a reason you wrote @128 for some of your instructions and not others?


I wasn't assuming any particular processor was in use, I simply provided the largest alignment that could be guaranteed for the given multiple of 480 bytes assuming the source image started of 128byte aligned.

hth
s.
0

#10 User is offline   sim 

  • Regular Contributor
  • PipPipPip
  • Group: Members.
  • Posts: 419
  • Joined: 04-October 06

Posted 27 September 2010 - 08:01 AM

If you're down to final tweaking, it might be worth experimenting with preloading ahead in the source image.
Something like:

;; now at 256 8-bit values

	...
	VPADDL.u8	Q3,Q3; 8 adds
	PLD		[r1,#((4*640)-256)]

	...
	VPADAL.u8	Q3,Q7; 16 adds
	PLD		[r2,#((4*640)-256)]

	...
	VPADAL.u8	Q3,Q11; 16 adds
	PLD		[r3,#((4*640)-256)]

	...
	VPADAL.u8	Q3,Q15; 16 adds
	PLD		[r4,#((4*640)-256)]

;; now at 32 16-bit values


s.
0

#11 User is offline   isogen74 

  • Super Contributor
  • PipPipPipPip
  • Group: Members
  • Posts: 1098
  • Joined: 20-March 07

Posted 27 September 2010 - 09:27 AM

Quote

Especially since my optimised C code is structured to do the exact same thing as what my ARM assembly code does, but obviously the C compiler didn't agree!


Very few compilers generate "weird" instructions - so if you are after anything a little special in the instruction set the odds are you will either need to use intrinsics for that instruction or fall back to assembler.
When optimizing software, consider that the quickest code to run is the bit you removed from the call path.
0

#12 User is offline   shervin 

  • Contributor
  • PipPip
  • Group: Members
  • Posts: 52
  • Joined: 17-September 10

  Posted 27 September 2010 - 11:23 AM

View Postsim, on Sep 27 2010, 08:01 AM, said:

If you're down to final tweaking, it might be worth experimenting with preloading ahead in the source image.

Its funny I was thinking of asking you about memory preloading but I thought I had already asked too much of your time as it is :-) From the few message posts I've read about NEON optimisation (I think mainly in the FFmpeg msg boards), they say that memory preloading involves some trial & error to get the right values in the right places?

I tried aligning in GCC using:
VLD1.u8 {q0}, [r0:128]!
but it still gives an error, and I tried every keyboard symbol in place of @ but it still wont work. I'll try using NASM instead.
Anyway I still don't understand why you aligned some to @128 and some to @64 and some to nothing. Wouldn't it work better if all 8 loads & the store use align (such as @64 on everything if its a 480 pixel wide image or @128 if its a 640 pixel wide image)?

Thanks a lot for your help! I'm still contemplating whether to attempt a generic image resizing function (from any size to any size) using NEON or whether it would be too difficult to take advantage of SIMD for that type of operation.

Cheers,
Shervin Emami.
http://www.shervinemami.co.cc/
0

#13 User is offline   shervin 

  • Contributor
  • PipPip
  • Group: Members
  • Posts: 52
  • Joined: 17-September 10

  Posted 27 September 2010 - 12:35 PM

I just installed NASM only to discover that it doesn't support ARM. So I guess I'll stick with the XCode default "gcc-4.2 -x assembler-with-cpp" and not have NEON alignment. And I finally figured out why you only gave alignment on some rows and not others: because you thought I had a 480 pixel wide image and since 480 is not divisible by 64 or 128, only some rows would have good alignment.

Like I said, thanks a lot both of you for your help with jump starting me on ARM and NEON development.
0

#14 User is offline   isogen74 

  • Super Contributor
  • PipPipPipPip
  • Group: Members
  • Posts: 1098
  • Joined: 20-March 07

Posted 27 September 2010 - 01:42 PM

View Postshervin, on Sep 27 2010, 01:35 PM, said:

... and not have NEON alignment.


There are some known bugs with the early implementations of alignment annotations in GNU Assembler Syntax, so depending how old your compiler is (4.2 is quite old, so I think suffers from this bug) you may have to bodge your code to use the old syntax.

In summary - the buggy implementation needed a extra ',' between the register and the alignment.

@ Buggy form, which works on older GAS assembler
VLD1.8 {d0}, [r1, :128]

@ Correct version which works in new GAS assembler (old form still supported though)
VLD1.8 {d0}, [r1 :128]


See ...

http://www.listware.net/201006/gnu-binutil...acceptance.html

This post has been edited by isogen74: 27 September 2010 - 01:49 PM

When optimizing software, consider that the quickest code to run is the bit you removed from the call path.
0

#15 User is offline   JerryClifford 

  • Member
  • Pip
  • Group: Members
  • Posts: 1
  • Joined: 02-October 10

  Posted 02 October 2010 - 11:03 AM

thanks guys for this great info here. :)
0

#16 User is offline   shervin 

  • Contributor
  • PipPip
  • Group: Members
  • Posts: 52
  • Joined: 17-September 10

Posted 03 October 2010 - 09:52 AM

View Postisogen74, on Sep 27 2010, 02:42 PM, said:

In summary - the buggy implementation needed a extra ',' between the register and the alignment.

Yes you are right, it works when I use:
			vld1.8 {d0}, [r1, :128]

Thanks! I actually posted the issue on the gcc-help mailing list and got a reply from Richard Earnshaw at ARM saying that it is a bug in old versions of the assembler in binutils (not the gcc compiler), and that:

Quote

I've just realized that older binutils are buggy and don't parse this correctly. It will be fixed in the up-coming binutils 2.21 release, or you can download the latest sources from www.sourceware.org.


Now I'm ready to start making more optimized functions :-) This is my first time trying to write SIMD code, so I'm wondering, is there any websites or something that show tricks of the trade or useful advice for writing SIMD code by hand? Otherwise I'll just try to figure it out myself based on the ARM + NEON instruction set.

Cheers,
Shervin Emami.
0

#17 User is offline   ThomasMarz 

  • Member
  • Pip
  • Group: Members
  • Posts: 1
  • Joined: 15-October 10

Posted 15 October 2010 - 07:59 AM

I been trying to find this info for a long time now lol. thanks.
0

Share this topic:


Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic