This site uses cookies to store information on your computer. By continuing to use our site, you consent to our
cookies.
ARM websites use two types of cookie: (1) those that enable the site to function and perform as required; and (2) analytical cookies which anonymously track visitors only while using the site. If you are not happy with this use of these cookies please review our Privacy Policy to learn how they can be disabled. By disabling cookies some features of the site will not work.
I think as pointed out in your other post on this topic:
(1) you can't do this directly in NEON (for short strides you can use VLD2/3/4 and discard the bits you don't want but that isn't very efficient)
(2) generally it is inefficient, so you are better off restructuring your data, or unpacking the loop, so you don't need to.
Iso
When optimizing software, consider that the quickest code to run is the bit you removed from the call path.