Login

Important information

This site uses cookies to store information on your computer. By continuing to use our site, you consent to our cookies.

ARM websites use two types of cookie: (1) those that enable the site to function and perform as required; and (2) analytical cookies which anonymously track visitors only while using the site. If you are not happy with this use of these cookies please review our Privacy Policy to learn how they can be disabled. By disabling cookies some features of the site will not work.

ARM Community: A8/9 NEON 128bit registers, 64bit alu's - ARM Community

Jump to content

Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic

A8/9 NEON 128bit registers, 64bit alu's Rate Topic: -----

#1 User is offline   dzn 

  • Member
  • Pip
  • Group: Members
  • Posts: 3
  • Joined: 01-December 11

Posted 01 December 2011 - 05:12 PM

Hi,

I've been reading a lot about the neon architecture for a project but there is still this one thing I'm not entirely sure about. If I understand correctly the 128bit-view is more of an aid for the programmer since the alu's in the neon engine are only 64bit wide instructions working on a qx register will just take double the time. I was going over the timing tables trying my best to understand them :-) and saw this confirmed in some instruction timings but in other the operation on 128bit would also just take one cycle ( add for example ).

Did I read the table wrong? If not: how is this achieved? Is this divided over 2 64bit adders which are coupled ( carry ) then?

Thanks in advance
0

#2 User is offline   Exophase 

  • Regular Contributor
  • PipPipPip
  • Group: Members
  • Posts: 118
  • Joined: 20-July 10

Posted 01 December 2011 - 06:01 PM

This is a basic rundown of what Cortex-A8/Cortex-A9's NEON implementation provides (note that this is slightly speculative, but pretty well supported by existing documentation):

- 2 64-bit simple integer ALUs, which are capable of add/sub/logic/shifts/compares/min/max/etc. Only one of them is capable of some operations like bit selects, variable shifts, and horizontal operations. And of course anything widening or narrowing isn't 128-bit to 128-bit. Note that the ALUs can do some full 64-bit operations like add/sub/shift.
- 1 64/128-bit permute unit.. there are some 128 to 64-bit operations like vmovn that are one cycle, and some 128-bit operations like reverse and swap are too, but for the most part it's 1-cycle for 64-bit like with zip/unzip and ext. tbl is at least 2 cycles and 64-bit only.
- 8 8x16 integer multipliers w/accumulate. These can be chained to do 8 8x8 mac, 4 16x16 mac, or 1 32x32 mac in a cycle (note the last one requires 2 32x32 mac in 2 cycles because of the register arrangement)
- 1 128-bit load/store unit
- 2 single precision floating point multipliers and 2 single precision floating point add/sub/cmp/etc

Aside from what's mentioned in literature and the TRM's timings I've confirmed most of this experimentally.

So a majority of simple integer operations (not counting multiplies) can be performed in 1 cycle, as can loads/stores and some permutes. I think that ARM wants to maintain NEON performance as being about double the throughput of the ARMv6 equivalent, where you have 2 32-bit ALUs (with some SIMD operations), 4 8x16 multipliers (although you can't do fully independent 16x16 macs or anything 8x8 or 8x16) and 1 single/double precision FPU. On Cortex-A5 NEON only has one 64-bit ALU, which corresponds with the integer core only having one one 32-bit ALU.
1

#3 User is offline   shervin 

  • Contributor
  • PipPip
  • Group: Members
  • Posts: 52
  • Joined: 17-September 10

Posted 10 December 2011 - 03:11 AM

It's true that NEON has 128-bit registers but nearly always operates on just 64-bits at a time. But the size of the registers is an Architecture (language) specification that can't change whereas the internal use of 64-bits is an implementation issue that will change over time, so you can expect that perhaps in 1 more year, ARM devices will operate on 128-bits at a time instead of 64-bits. So if you write your NEON code now for 128-bit, it will be more future proof because the same code will potentially double in speed in the future!

Cheers,
Shervin Emami.
http://www.shervinem...rmAssembly.html
1

#4 User is offline   Exophase 

  • Regular Contributor
  • PipPipPip
  • Group: Members
  • Posts: 118
  • Joined: 20-July 10

Posted 12 December 2011 - 05:11 PM

View Postshervin, on 10 December 2011 - 03:11 AM, said:

It's true that NEON has 128-bit registers but nearly always operates on just 64-bits at a time.


This is NOT true, please read my post. For integer operations 128-bit operation is more the rule than the exception, at least on Cortex-A8 and A9. If you're targeting these devices it's important to know what operations do and don't operate on 128-bits in one cycle.
0

#5 User is offline   shervin 

  • Contributor
  • PipPip
  • Group: Members
  • Posts: 52
  • Joined: 17-September 10

Posted 12 December 2011 - 10:20 PM

View PostExophase, on 12 December 2011 - 05:11 PM, said:

This is NOT true, please read my post. For integer operations 128-bit operation is more the rule than the exception, at least on Cortex-A8 and A9. If you're targeting these devices it's important to know what operations do and don't operate on 128-bits in one cycle.


Yes you are right, many operations use all 128-bits at once on Cortex-A8/A9, so your detailed analysis is quite good. I was just explaining the rough overview that even if the more complex 128-bit operations take 2 (or occasionally 3) cycles instead of 1, it is still a good idea to write NEON code for 128-bits because future ARMv7 devices will soon have even more 128-bit single-cycle NEON paths.

-Shervin.
0

#6 User is offline   Exophase 

  • Regular Contributor
  • PipPipPip
  • Group: Members
  • Posts: 118
  • Joined: 20-July 10

Posted 12 December 2011 - 10:28 PM

View Postshervin, on 12 December 2011 - 10:20 PM, said:

Yes you are right, many operations use all 128-bits at once on Cortex-A8/A9, so your detailed analysis is quite good. I was just explaining the rough overview that even if the more complex 128-bit operations take 2 (or occasionally 3) cycles instead of 1, it is still a good idea to write NEON code for 128-bits because future ARMv7 devices will soon have even more 128-bit single-cycle NEON paths.

-Shervin.


I partially agree with this. Using 128-bit operations instead of 2x 64-bit even where the CPU takes 2 cycles can also save fetch/decode time, although that isn't usually a bottleneck with NEON on A8/A9. But, depending on your code, it could end up costing cycles moving to a 128-bit granularity, if you're not always using all the elements in the vector. In these cases you may be better off sticking with the 64-bit forms.

Of course, if future proofing really is a big goal then that probably trumps this.

This post has been edited by Exophase: 12 December 2011 - 10:29 PM

0

Share this topic:


Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic