Login

Important information

This site uses cookies to store information on your computer. By continuing to use our site, you consent to our cookies.

ARM websites use two types of cookie: (1) those that enable the site to function and perform as required; and (2) analytical cookies which anonymously track visitors only while using the site. If you are not happy with this use of these cookies please review our Privacy Policy to learn how they can be disabled. By disabling cookies some features of the site will not work.

ARM Community: ARM NEON equivalent of Intel SSE - ARM Community

Jump to content

Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic

ARM NEON equivalent of Intel SSE Rate Topic: ****- 1 Votes

#1 User is offline   zibba 

  • Member
  • Pip
  • Group: Members
  • Posts: 2
  • Joined: 15-May 12

Posted 15 May 2012 - 02:35 AM

Hi

I'm trying to convert some Intel SSE code to ARM NEON and I can't find the equivalent NEON instructions of the SSE single-precision instructions _mm_mul_ss, _m_add_ss. Please help.

thanks
0

#2 User is offline   isogen74 

  • Super Contributor
  • PipPipPipPip
  • Group: Members
  • Posts: 1098
  • Joined: 20-March 07

Posted 15 May 2012 - 07:09 AM

I don't believe NEON has any direct equivalent - you'll either have to:

* multiply out everything and merge two result registers to get what you want
* do the scalar multiply on the ARM core using normal FPU instructions, and then merge that in. You could use single-lane load store rather than register mangling in this case, which _may_ be faster.
When optimizing software, consider that the quickest code to run is the bit you removed from the call path.
0

#3 User is offline   zibba 

  • Member
  • Pip
  • Group: Members
  • Posts: 2
  • Joined: 15-May 12

Posted 15 May 2012 - 08:08 AM

View Postisogen74, on 15 May 2012 - 07:09 AM, said:

I don't believe NEON has any direct equivalent - you'll either have to:

* multiply out everything and merge two result registers to get what you want
* do the scalar multiply on the ARM core using normal FPU instructions, and then merge that in. You could use single-lane load store rather than register mangling in this case, which _may_ be faster.


Thanks for the quick reply. At least that explains why I couldn't find those instructions. I think I'll use the first version to avoid doing to and from the FPU.
1

Share this topic:


Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic