Quick Links
Need help in GCC intrinsics for NEON
#1
Posted 04 April 2012 - 06:19 AM
Can somebody tell me what are the equivalent GCC and ARM intrinsics for generating the below NEON ASM statements?
vld3.16 {d0,d2,d4},[r0]!
vld3.16 {d1,d3,d5},[r0]!
Thanks,
Kiran
#2
Posted 09 April 2012 - 06:03 AM
KiranKumar, on 04 April 2012 - 06:19 AM, said:
Can somebody tell me what are the equivalent GCC and ARM intrinsics for generating the below NEON ASM statements?
vld3.16 {d0,d2,d4},[r0]!
vld3.16 {d1,d3,d5},[r0]!
Thanks,
Kiran
For the RVCT 5.01, I see -
vld3.16 {d0,d2,d4},[r0] is represented by vld3q_u16(__transfersize(24) uint16_t const * ptr);
but to be sure check your documentation.
#3
Posted 09 April 2012 - 09:04 AM
uint16x8x3_t vld3q_u16 (const uint16_t *)
Check http://gcc.gnu.org/o...Intrinsics.html for the full listing (loads and stores are towards the bottom).
One bit of advice - use objdump to check the disassembly GCC emits for NEON intrinsics. Personally I've never been entirely happy with it - it generates an excessive amount of stack traffic to shuffle things between registers - and the intrinsics are so low level you may as well handle register allocation yourself, write the assembler and get the output code you actually wanted in the first place.
To be fair it is improving a lot in the newer GCC releases, but my personal view is that if you have to spell out instructions using intrinsics one instruction at a time you are basically writing assembler anyway
Iso
This post has been edited by isogen74: 09 April 2012 - 09:05 AM
#4
Posted 09 April 2012 - 11:42 AM
My actual question should have been different.
vld3.16 {d0,d2,d4},[r0]!
vld3.16 {d1,d3,d5},[r0]!
vadd.16 q3,q0,q1
Actually after filling the data into d0,d1 registers, i want to use them as one Q-register.
I can do that by writing the assembly. But I want to know whether I can do the same thing using Intrinsics and how .
I also experienced the same problem as you mentioned with GCC tools.
But when there so not much ARM code between NEON codes or NEON intrinsics statements, then GCC is doing better in
generating assembly with "Tighter Neon" code with out data transactions between registers and stack.
What I observed is register abstraction to the neon variables used in intrinsics is not properly as it is doing for the ARM code.
Please let me what I am observing is correct.
BRs,
Kiran Kumar
isogen74, on 09 April 2012 - 09:04 AM, said:
uint16x8x3_t vld3q_u16 (const uint16_t *)
Check http://gcc.gnu.org/o...Intrinsics.html for the full listing (loads and stores are towards the bottom).
One bit of advice - use objdump to check the disassembly GCC emits for NEON intrinsics. Personally I've never been entirely happy with it - it generates an excessive amount of stack traffic to shuffle things between registers - and the intrinsics are so low level you may as well handle register allocation yourself, write the assembler and get the output code you actually wanted in the first place.
To be fair it is improving a lot in the newer GCC releases, but my personal view is that if you have to spell out instructions using intrinsics one instruction at a time you are basically writing assembler anyway
Iso
#5
Posted 10 April 2012 - 10:53 PM
I can't see a way of directly doing what you want using intrinsics - you generally have to either cast the intrinsic structure type pointers and type-pun (which is probably bad on newer compilers with strict aliasing) - or memcpy fields between them to get things in the order you want. Unfortunately compilers don't really like the aliasing of d-registers to q-registers, so this tends to be one area the code gen suffers a bit in my experience.















