Division with NEON Sample code of VRECPE
Posted 30 September 2011 - 01:08 PM
I have 4 unsigned 16bit values into a Dn register (or 8 into a Qn register)
[v1] [v2] [v3] [v4]
I'm looking for the code to finaly have
[65536 / v1] [65536 / v2] [65536 / v3] [65536 / v4]
into another (or the same) Dn (or Qn) register...
Posted 30 September 2011 - 01:13 PM
Posted 30 September 2011 - 01:35 PM
That's exaclty what I'm looking for.
I'd like to know how to use
I don't understand what is the estimation of 1 / 1234 when I'm using U32 data type !
I've found this code
vrecpe.f32 d1, d5 vrecps.f32 d2, d1, d5 vmul.f32 d1, d1, d2 vrecps.f32 d2, d1, d5 vmul.f32 d5, d1, d2
and it's work correctly with float operation.
I'm looking for the same code using unsigned integer !
Posted 03 October 2011 - 09:14 PM
Another way to look at it is that vrecpe.u32 works on values between 0.5 and 1.0 (non-inclusive), where the format is 0.1.31. That means no sign bits, 1 whole bit, and 31 fraction bits. Due to the input constraints the top bit will always be 0.
The reason for this format is to limit the possible range of the calculated reciprocal, which you'll notice must be between 1.0 and 2.0. The one whole number bit was kept available to satisfy this range. If you didn't perform this range limiting you wouldn't be able to define very useful data representations for integer reciprocals, since the reciprocal of any whole number is a fraction.
What normalization does is converts an input x to the format:
x_normalized = x * 2^shift
x = x_normalized * 2^-shift
Where the multiplication can be performed by a bit-shift. Note that for the reciprocal:
x_reciprocal = 1 / x = 1 / (x_normalized * 2^-shift) = (1 / x_normalized) * 2^shift
Which means that you end performing a left shift in the end to undo the normalization. This is instead of a right shift because the reciprocal changes the sign of the power.
Then for the actual division:
a = y / x
a = y * (1 / x)
a = y * (1 / x_normalized * 2^-shift)
a = (y * (1 / x_normalized)) * 2^-shift
You can find the normalization shift with a count leading zeroes instruction. In your case you'll want to use vclz.u16. But you need to leave that integer bit, so you want to set shift equal to clz(x) - 1.
However, you will not always get the correct answer using vrecpe.u32, because it's only correct to ~8 bits. In order to improve the result to get correct 16 bit values you need to use Newton-Raphson iteration. That is, for y = 1 / x,
y_refined = y * (2 - (x * y))
This is kind of a pain to do in integer on NEON because there's no vrecps equivalent instruction and since this is a fixed point multiplication you need the long answer, only to throw out the bottom bits. Honestly you're probably better off just converting to floating point and back. You don't even have to do the final multiplication, you can use vcvt to convert between floating point and fixed point and do the multiplication (left shift by 16) for free. Of course, you can do something similar if you stick with integer.
Posted 04 October 2011 - 07:02 AM
I've check the precision of the divide approxiamtion. You're right it is near to 8 bit!
that enough for me. so finaly, the code I used is this one
vcvt.f32.u32 q0, q0 vrecpe.f32 q0, q0 vmul.f32 q0, q0, q1 @ q1 = 65536 vcvt.u32.f32 q0, q0
precision is enough for my colour traitment !
speed is quite correct !
Posted 05 October 2011 - 07:41 AM
vcvt.f32.u32 q0, q0 vrecpe.f32 q0, q0 vcvt.u32.f32 q0, q0, #16
I'll ckeck this evening but it should work !!!
Thank's for this usefull optimisation !
Posted 26 September 2012 - 02:30 PM
I ended up on this discussion while looking for a way to use integer division into ARM NEON registers.
My problem is similar, the only difference is that I need to work with signed 16 bit integers instead of unsigned ones.
Is there any way to do this? Also, I don't understand how the original uint16 bit problem of the post has been solved converting uint32 to float32. How can this be possible with the case of the uint16x8_t bit Qn register?
Posted 26 September 2012 - 04:24 PM
// 8x16-bit signed inputs are in q0 // Elements in q1 are 0xFFFF for negative values, 0x0000 for positive (or zero) values vclt.s16 q1, q0, #0 // Make negative values positive vabs.s16 q0, q0 // ... Division performed here, results in q0 ... // Negate values that were negative. This is done by observing that neg(x) = not(x) + 1. // For values that were negative the field in q1 was 0xFFFF, therefore we get ((x ^ 0xFFFF) - 0xFFFF) which is not(x) + 1. // For values that were positive the field in q1 was 0x0000, therefore we get (x ^ 0x0000) - 0x0000 which is just x. // If you can, put some other operation between these two instructions to avoid a stall. veor.s16 q0, q0, q1 vsub.s16 q0, q0, q1
Note that this will round the negative result towards zero if the positive result was rounded towards zero. This is how most CPU integer divide instructions work, but if you want one that rounds negative values towards negative infinity you'll have to do this differently.
For the actual division please refer to my post from October 3. Note that webshaker didn't need fully accurate results, so he was able to just use the reciprocal approximation instruction by itself. But if you want accurate result that won't work. He also must have been starting with 4x32-bit unsigned values so he probably did the conversion from 16-bit to 32-bit earlier in code he didn't show (with a vmovw or something).