Login

Important information

This site uses cookies to store information on your computer. By continuing to use our site, you consent to our cookies.

ARM websites use two types of cookie: (1) those that enable the site to function and perform as required; and (2) analytical cookies which anonymously track visitors only while using the site. If you are not happy with this use of these cookies please review our Privacy Policy to learn how they can be disabled. By disabling cookies some features of the site will not work.

ARM Community: NEON: fast 128 bit comparison - ARM Community

Jump to content

Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic

NEON: fast 128 bit comparison Rate Topic: -----

#1 User is offline   Mircea 

  • Member
  • Pip
  • Group: Members
  • Posts: 3
  • Joined: 30-January 12

Posted 30 January 2012 - 06:41 PM

I'm interested in finding the fastest way (lowest cycle count) of comparing the values stored in two NEON registers (say Q0 and Q3) on a Cortex-A9 core (VFP instructions allowed).

So far I have the following:

(1) Using the VFP floating point comparison:


[/font][/size]

[size="2"]
vcmp.f64        d0, d6
vmrs            APSR_nzcv, fpscr
vcmpeq.f64      d1, d7
vmrseq          APSR_nzcv, fpscr[/size]
[size="2"]


If the 64bit "floats" are equivalent to NaN, this version will not work.

(2) Using the NEON narrowing and the VFP comparison (this time only once and in a NaN-safe manner):


[/font][/size]
vceq.i32        q15, q0, q3
vmovn.i32       d31, q15
vshl.s16        d31, d31, #8
vcmp.f64        d31, d29
vmrs            APSR_nzcv, fpscr





The D29 register is previously preloaded with the right 16bit pattern:

vmov.i16        d29, #65280 	; 0xff00

[font="Arial,"]

[size="2"][font="Arial,"]My question is: is there any better than this? Am I overseeing some obvious way to do it?




This post has been edited by Mircea: 30 January 2012 - 06:54 PM

0

#2 User is offline   Exophase 

  • Regular Contributor
  • PipPipPip
  • Group: Members
  • Posts: 118
  • Joined: 20-July 10

Posted 30 January 2012 - 07:22 PM

Could you describe more what you're going to be using the comparison for? Either way you do it there will be a performance penalty involved in using NEON to influence control flow, even on Cortex-A9. And you probably don't gain an awful lot by performing the comparison in VFP and doing an msr instead of moving a compare mask to ARM registers and doing tests there.

If at all possible you should keep things in NEON and perform masking selects instead of control flow. If you can't do this then you should re-evaluate if NEON makes much sense here.
0

#3 User is offline   webshaker 

  • Regular Contributor
  • PipPipPip
  • Group: Members
  • Posts: 220
  • Joined: 07-October 10

Posted 31 January 2012 - 08:48 AM

If you are interrested by performance issue,
you can forget instuction like

vmrs


or any instruction that transfert data form NEON to ARM register.

In fact, you have 2 solutions.
- If you have a simple comparaison to do, you should considere the possibility do to it with the ARM and not NEON.
- If you have a lot of comparaion to do (in a loop for example), then use memory buffer to store the comparaison result. In this case, you make 8 or 16 (or more) 128 bit comparaison and then you make the ARM part with the stored result. This is the best way to let NEON and the ARM working both together.

When you have to use VMRS, the ARM will have to wait for NEON to finish its computation.
Because NEON have a instruction Queue, it can take 10, 20 ou more cycles, just to let NEON reach your comparaison instruction.
After that you will have to wait for extra cycles to transfert the data fro NEON to ARM register.

So. the rule is.
Never try to transfert any NEON register to ARM register.


PS: this is not the subject but the reverse operation can be done. Transfering a ARM register to NEON register is very fast because the content of the ARM register is copyed into the instruction queue.

This post has been edited by webshaker: 31 January 2012 - 08:53 AM

When you have eliminated the impossible, whatever remains, however improbable, must be the truth
1

#4 User is offline   Mircea 

  • Member
  • Pip
  • Group: Members
  • Posts: 3
  • Joined: 30-January 12

Posted 31 January 2012 - 09:39 AM

The scenrio is the following: I have a 128 bit constant K and I want to compare it with the 128 bit "variables" A1, A2... An in the following manner (with the loop unrolled):

for i = 1, n do
  if A[i] != K then
    break;
done



So... should I lock down two cache lines and use them to communicate between VFP/NEON and ARM?

Do the same performance penalties (regarding the pipelines) apply when using 64 bits-at-a-time comparisons (VFP only)?

This post has been edited by Mircea: 31 January 2012 - 09:40 AM

0

#5 User is offline   webshaker 

  • Regular Contributor
  • PipPipPip
  • Group: Members
  • Posts: 220
  • Joined: 07-October 10

Posted 31 January 2012 - 10:59 AM

Ok
let suppose
r0 = pointer of A
r1 = tmp buffer for storing comparaison result
n (the number of iteration) is a multiple of 8.


mov              r12, #nmov              r2, r1



First part. You begin to fill NEON queue with 8 computation



mov              r3, #4
.neon_loop_start_fill:
vld1.32          {q0, q1}, [r0]!             		@ load 2 128 bit value
vceq.i32 		q14, q0, q10                        @ all 1 if eq
vceq.i32 		q15, q1, q10                        @ all 1 if eq
vshrn.u64        d10, q14, #8                        @ reduce to 64 bit
vshrn.u64        d11, q15, #8                        @ reduce to 64 bit
vshrn.u64        d0, q5, #8                          @ reduce to 2 * 32 bit
vst1.32          {d0}, [r1]!
subs     		r3, r3, #1
bne              .neon_loop_start_fill

sub              r12, r12, #8                        @ even compute 8 cmp

This code will fill the NEON queue in order to be sure that when the ARM works, NEON have something to do.

Now, the main loop.
in this loop, you calculate 2 comparaison.


.main_loop:
vld1.32          {q0, q1}, [r0]!             		@ load 2 128 bit value
vceq.i32 		q14, q0, q10                        @ all 1 if eq
vceq.i32 		q15, q1, q10                        @ all 1 if eq
vshrn.u64        d10, q14, #8                        @ reduce to 64 bit
vshrn.u64        d11, q15, #8                        @ reduce to 64 bit
vshrn.u64        d0, q5, #8                          @ reduce to 2 * 32 bit
vst1.32          {d0}, [r1]!

ldr              r4, [r2], #4
cmp              r4, #-1
bne .break


ldr              r4, [r2], #4
 cmp               r4, #-1
bne .break

subs     		r12, r12, #2
bne              .main_loop


NEON write into r1, while the ARM read previous compare result from r2 !
By this way, there is no conflict. NEON and ARM can work both together.

At the end, when NEON have finish to compute, the ARM still have 8 compare result to handle.

mov              r3, #4
.arm_loop_end:
ldr              r4, [r2], #4
cmp               r4, #-1
bne .break


ldr              r4, [r2], #4
cmp               r4, #-1
bne .break

subs     		r3, r3, #1
bne              .arm_loop_end


I haven't test this code. (so it will probably not works)
The most important is to understand the process. In this code, the ARM never need the wait for NEON
The performance is clearly depending of the value of n the numbre of iteration.
I guess that if n < 16 this code will probably not be efficient.

Etienne

This post has been edited by webshaker: 31 January 2012 - 11:00 PM

When you have eliminated the impossible, whatever remains, however improbable, must be the truth
1

#6 User is offline   Mircea 

  • Member
  • Pip
  • Group: Members
  • Posts: 3
  • Joined: 30-January 12

Posted 07 February 2012 - 06:19 PM

Thanks a lot for the detailed solution. I do, however, have a number of questions:

  • Wouldn't it be better to "unroll" the "pre-filling loop" in order to avoid the branch mispredictions?
  • Why are there exactly FOUR iterations (comparison pairs) in the "pre-filling loop"?
  • Why are there exactly TWO 128-bit comparisons in one iteration? Is this because the A9 is dual-issue?
  • How does the memory disambiguation mechanism impact the performance in your solution? Doesn't it stall the ARM pipeline?
  • Why is the transfer from VFP registers to ARM registers (more precisely the VMRS instruction) so frowned upon? The "Cortex A9 MPE TRM" (section 3.4.10) states the a transfer from a VFP register to an "integer core" register has a latency of only 3 cycles. How does the pipeline impact of VMRS compare to the impact of the memory disambiguation mechanism?

Thanks!
0

#7 User is offline   webshaker 

  • Regular Contributor
  • PipPipPip
  • Group: Members
  • Posts: 220
  • Joined: 07-October 10

Posted 08 February 2012 - 02:04 PM

瀏覽文章引用框(Mircea @ 07 February 2012 - 06:19 PM)

Wouldn't it be better to "unroll" the "pre-filling loop" in order to avoid the branch mispredictions?

On modern processor, the branch prediction is so good that it's not (most of time) usefull to unroll the loop.
You may win some cycles but this is not the most interesting optimisation.

瀏覽文章引用框(Mircea @ 07 February 2012 - 06:19 PM)

Why are there exactly FOUR iterations (comparison pairs) in the "pre-filling loop" ?

The idea is to fill the NEON queue. There is no reason to do exactly 4 interations
Less that 4 and you could have 2 problem : not enough instructions into NEON queue and a possible intraction between NEON memory write and ARM memory read
More than 4 and the algorithm will need a bigger n value (number of iteration) to be efficient.

瀏覽文章引用框(Mircea @ 07 February 2012 - 06:19 PM)

Why are there exactly TWO 128-bit comparisons in one iteration? Is this because the A9 is dual-issue?

No! This is because NEON can't write a single 32 bit value. The smallest NEON write is 64 bits (2 * 32).
So I'm making 2 comparaison by iteration to get 2 32bits result to write.

瀏覽文章引用框(Mircea @ 07 February 2012 - 06:19 PM)

How does the memory disambiguation mechanism impact the performance in your solution? Doesn't it stall the ARM pipeline?

Using memory buffer to transfert data from NEON to ARM will allow the 2 units to works both together without any dependency problem while they do not work on the same datas.
That's All. It just avoid to use data transfer units.
I've made some tests on the Cortex A8 few month ago about that. http://pulsar.websha...n-arm-and-neon/

瀏覽文章引用框(Mircea @ 07 February 2012 - 06:19 PM)

Why is the transfer from VFP registers to ARM registers (more precisely the VMRS instruction) so frowned upon?
The "Cortex A9 MPE TRM" (section 3.4.10) states the a transfer from a VFP register to an "integer core" register has a latency of only 3 cycles.
How does the pipeline impact of VMRS compare to the impact of the memory disambiguation mechanism?

The main problem is due to the depedency between:
- the test
- the VMRS
- the conditional instruction.
We speak about pipelined processor. The three steps are fully dependent.
The given cycle information are given for fully pipelined instruction.
So. In real life, you'll not be able to obtain 3 cycle to execute the MOV from VPf to ARM unit.

The best you can do is to make some bench :)
After all, that's not impossible that the Vpf version if the fastest one !!!
When you have eliminated the impossible, whatever remains, however improbable, must be the truth
1

Share this topic:


Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic