Login

Important information

This site uses cookies to store information on your computer. By continuing to use our site, you consent to our cookies.

ARM websites use two types of cookie: (1) those that enable the site to function and perform as required; and (2) analytical cookies which anonymously track visitors only while using the site. If you are not happy with this use of these cookies please review our Privacy Policy to learn how they can be disabled. By disabling cookies some features of the site will not work.

ARM Community: Need help in GCC intrinsics for NEON - ARM Community

Jump to content

Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic

Need help in GCC intrinsics for NEON Rate Topic: ****- 1 Votes

#1 User is offline   KiranKumar 

  • Member
  • Pip
  • Group: Members
  • Posts: 2
  • Joined: 04-April 12

Posted 04 April 2012 - 06:19 AM

Hi All,


Can somebody tell me what are the equivalent GCC and ARM intrinsics for generating the below NEON ASM statements?

vld3.16 {d0,d2,d4},[r0]!
vld3.16 {d1,d3,d5},[r0]!

Thanks,
Kiran
0

#2 User is offline   archie 

  • Member
  • Pip
  • Group: Members.
  • Posts: 13
  • Joined: 09-April 12

Posted 09 April 2012 - 06:03 AM

 KiranKumar, on 04 April 2012 - 06:19 AM, said:

Hi All,


Can somebody tell me what are the equivalent GCC and ARM intrinsics for generating the below NEON ASM statements?

vld3.16 {d0,d2,d4},[r0]!
vld3.16 {d1,d3,d5},[r0]!

Thanks,
Kiran


For the RVCT 5.01, I see -
vld3.16 {d0,d2,d4},[r0] is represented by vld3q_u16(__transfersize(24) uint16_t const * ptr);
but to be sure check your documentation.
1

#3 User is offline   isogen74 

  • Super Contributor
  • PipPipPipPip
  • Group: Members
  • Posts: 1098
  • Joined: 20-March 07

Posted 09 April 2012 - 09:04 AM

GCC is the same as RVCT:

uint16x8x3_t vld3q_u16 (const uint16_t *)

Check http://gcc.gnu.org/o...Intrinsics.html for the full listing (loads and stores are towards the bottom).
One bit of advice - use objdump to check the disassembly GCC emits for NEON intrinsics. Personally I've never been entirely happy with it - it generates an excessive amount of stack traffic to shuffle things between registers - and the intrinsics are so low level you may as well handle register allocation yourself, write the assembler and get the output code you actually wanted in the first place.

To be fair it is improving a lot in the newer GCC releases, but my personal view is that if you have to spell out instructions using intrinsics one instruction at a time you are basically writing assembler anyway ;)

Iso

This post has been edited by isogen74: 09 April 2012 - 09:05 AM

When optimizing software, consider that the quickest code to run is the bit you removed from the call path.
2

#4 User is offline   KiranKumar 

  • Member
  • Pip
  • Group: Members
  • Posts: 2
  • Joined: 04-April 12

Posted 09 April 2012 - 11:42 AM

Hi Thanks for the reply.

My actual question should have been different.

vld3.16 {d0,d2,d4},[r0]!
vld3.16 {d1,d3,d5},[r0]!
vadd.16 q3,q0,q1

Actually after filling the data into d0,d1 registers, i want to use them as one Q-register.
I can do that by writing the assembly. But I want to know whether I can do the same thing using Intrinsics and how .

I also experienced the same problem as you mentioned with GCC tools.
But when there so not much ARM code between NEON codes or NEON intrinsics statements, then GCC is doing better in
generating assembly with "Tighter Neon" code with out data transactions between registers and stack.
What I observed is register abstraction to the neon variables used in intrinsics is not properly as it is doing for the ARM code.

Please let me what I am observing is correct.


BRs,
Kiran Kumar




 isogen74, on 09 April 2012 - 09:04 AM, said:

GCC is the same as RVCT:

uint16x8x3_t vld3q_u16 (const uint16_t *)

Check http://gcc.gnu.org/o...Intrinsics.html for the full listing (loads and stores are towards the bottom).
One bit of advice - use objdump to check the disassembly GCC emits for NEON intrinsics. Personally I've never been entirely happy with it - it generates an excessive amount of stack traffic to shuffle things between registers - and the intrinsics are so low level you may as well handle register allocation yourself, write the assembler and get the output code you actually wanted in the first place.

To be fair it is improving a lot in the newer GCC releases, but my personal view is that if you have to spell out instructions using intrinsics one instruction at a time you are basically writing assembler anyway ;)

Iso

1

#5 User is offline   isogen74 

  • Super Contributor
  • PipPipPipPip
  • Group: Members
  • Posts: 1098
  • Joined: 20-March 07

Posted 10 April 2012 - 10:53 PM

Half the point of intrinsics is that they hide register allocation as it's a good thing for the compiler to handle, and closely tied to instruction scheduling which is the other point of using them rather than asm.
I can't see a way of directly doing what you want using intrinsics - you generally have to either cast the intrinsic structure type pointers and type-pun (which is probably bad on newer compilers with strict aliasing) - or memcpy fields between them to get things in the order you want. Unfortunately compilers don't really like the aliasing of d-registers to q-registers, so this tends to be one area the code gen suffers a bit in my experience.
When optimizing software, consider that the quickest code to run is the bit you removed from the call path.
1

Share this topic:


Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic