Login

Important information

This site uses cookies to store information on your computer. By continuing to use our site, you consent to our cookies.

ARM websites use two types of cookie: (1) those that enable the site to function and perform as required; and (2) analytical cookies which anonymously track visitors only while using the site. If you are not happy with this use of these cookies please review our Privacy Policy to learn how they can be disabled. By disabling cookies some features of the site will not work.

ARM Community: Cortex-R4 : does "dual-issued pairs" really improve performance ? - ARM Community

Jump to content

Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic

Cortex-R4 : does "dual-issued pairs" really improve performance ? Rate Topic: ****- 1 Votes

#1 User is offline   Christophe31 

  • Member
  • Pip
  • Group: Members
  • Posts: 17
  • Joined: 06-July 11

Posted 01 August 2011 - 08:06 AM

Hello,

Could someone help me to explain that behavior :
I use a sequence of 4096 instructions (target is TMS570/Cortex-R4F) :
movs r0,#1
str r0, [r8~#0]
movs r1,#2
str r1, [r8~#4]
movs r2,#3
str r3, [r8~#8]
...

When "dual-issue" mode is enabled (bits 28-31 of Auxiliary Control Register and bits 18-20 of Secondary Auxiliary Control Register are reset), this code (plus a few instructions bordering it) executes in 5162 clock cycles.
When "dual-issue" mode is disabled (same bits are set), this code executes in 4146 clock cycles !!!

I observe this phenomenon for both ARM and Thumb2 modes.

So when "dual-issue" mode is enabled, it seems that one pipeline stage is "sometimes" (once out of 4) waiting for dual words (thus introducing extra wait states) in order to process them by pairs, but I can't find any description of it.

Could someone help me to understand, please ? This is quite important for me, because I have to produce highly deterministic real-time software, and this kind of feature is hard to model...

Thanks for any help.

Best regards

Christophe

This post has been edited by Christophe31: 01 August 2011 - 08:55 AM

0

#2 User is offline   Christophe31 

  • Member
  • Pip
  • Group: Members
  • Posts: 17
  • Joined: 06-July 11

Posted 19 January 2012 - 01:51 PM

Hello everybody,

No one to answer ? I am stil confused with those results...

Thanks

Best regards

Christophe
0

#3 User is offline   Chris Turner 

  • Member
  • Pip
  • Group: Members.
  • Posts: 4
  • Joined: 26-January 11

Posted 19 January 2012 - 05:21 PM

Dual issue is described in a section at the end of the processor's Technical Reference Manual (TRM) which you can download from the support document resources on the ARM web site. You will see that only certain compbinations of instructions can dual issue and in fact there is not a completely duplicated pipeline within this processor (other ARM processors do have more extensive duplication of execution hardware).

Hope this helps.

View PostChristophe31, on 01 August 2011 - 08:06 AM, said:

Hello,

Could someone help me to explain that behavior :
I use a sequence of 4096 instructions (target is TMS570/Cortex-R4F) :
movs r0,#1
str r0, [r8~#0]
movs r1,#2
str r1, [r8~#4]
movs r2,#3
str r3, [r8~#8]
...

When "dual-issue" mode is enabled (bits 28-31 of Auxiliary Control Register and bits 18-20 of Secondary Auxiliary Control Register are reset), this code (plus a few instructions bordering it) executes in 5162 clock cycles.
When "dual-issue" mode is disabled (same bits are set), this code executes in 4146 clock cycles !!!

I observe this phenomenon for both ARM and Thumb2 modes.

So when "dual-issue" mode is enabled, it seems that one pipeline stage is "sometimes" (once out of 4) waiting for dual words (thus introducing extra wait states) in order to process them by pairs, but I can't find any description of it.

Could someone help me to understand, please ? This is quite important for me, because I have to produce highly deterministic real-time software, and this kind of feature is hard to model...

Thanks for any help.

Best regards

Christophe

0

#4 User is offline   Christophe31 

  • Member
  • Pip
  • Group: Members
  • Posts: 17
  • Joined: 06-July 11

Posted 20 January 2012 - 08:01 AM

Hi Chris,

Thanks for your answer.

I know that Cortex-R4 is "limited superscalar", but my question is : why does a particular sequence of code execute slower when dual issue is activated ? At worst, it is my understanding that if dual issue can not be applied (due to instruction sequence), it should execute at the same speed as with "deactivated dual issue mode".

Best regards

Christophe
0

#5 User is offline   Chris Turner 

  • Member
  • Pip
  • Group: Members.
  • Posts: 4
  • Joined: 26-January 11

Posted 21 January 2012 - 07:17 AM

With larger 'real' code we know that dual issue is usefully faster but this looks like a particular test case for memory accesses and I'm guessing the LSU is doing all the work and finding it harder for some reason. Let me consult with some colleagues next week. In case its a characteristic of the memory system can you describe the hardware please? Also, how are you measuring it?
0

#6 User is offline   Christophe31 

  • Member
  • Pip
  • Group: Members
  • Posts: 17
  • Joined: 06-July 11

Posted 23 January 2012 - 10:58 AM

Hi Chris,

The hardware I use is TI TMS570LS20216 (on a Keil evaluation board MCBTMS570), and you're perfectly right, this is a particular test for memory access performance for a deterministic behavior study.

This part of code makes access to internal SRAM (no wait state) through BTCMs.

Executed code is located in an internal flash accessed through ATCM.

I measure cycles with performance monitor unit (PMU) (mrc p15, #0, r0, c9, c13, #0)

Thanks for helping

Best regards

Christophe
1

#7 User is offline   Chris Turner 

  • Member
  • Pip
  • Group: Members.
  • Posts: 4
  • Joined: 26-January 11

Posted 24 January 2012 - 05:23 PM

Hello again. We've taken a quick look at this in a processor simulation and running your code does show the expected dual-issue benefit. We suspect that what you're seeing is due to the combination of rapid stores and instruction fetches somehow impacting the memory interface, flash pre-fetch etc. in the device you are using when dual-issue is working it hard. However, I'm sure you will see the benefit of dual issue on a real piece of code. I hope this helps. Regards, Chris
1

#8 User is offline   Christophe31 

  • Member
  • Pip
  • Group: Members
  • Posts: 17
  • Joined: 06-July 11

Posted 27 January 2012 - 08:58 AM

Hello Chris,

Thanks for your help, I will try to get support from TI...

Note that I have already seen the benefit of dual issue on real pieces of code, I do not challenge this, but this is not what I am focused on. My goal is to predict processor's behavior in any situation for hard real time avionic applications, for which determinism is the crucial leitmotiv.

Best regards

Christophe
0

#9 User is offline   Chris Turner 

  • Member
  • Pip
  • Group: Members.
  • Posts: 4
  • Joined: 26-January 11

Posted 27 January 2012 - 10:34 AM

Yes, see what they say about it. In closing, let me mention that predicting precise cycle counts for processors like Cortex-R4 is not an exact science because there are heuristics in the branch prediction and behaviours in store buffers and the like that may cause slight variations. However, you should find that real-time performance remains adequately deterministic thanks to this processor's fast interrupt entry mode in the pipeline, reduction of interrupt entry dependency on queued memory transactions, availability of TCM to store critical code and data without dependency on the main L1/L2 memory system and external bus, and the absence of any MMU that would trigger TLB misses, page table walks etc.

With best regards, Chris
1

Share this topic:


Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic