Login

Important information

This site uses cookies to store information on your computer. By continuing to use our site, you consent to our cookies.

ARM websites use two types of cookie: (1) those that enable the site to function and perform as required; and (2) analytical cookies which anonymously track visitors only while using the site. If you are not happy with this use of these cookies please review our Privacy Policy to learn how they can be disabled. By disabling cookies some features of the site will not work.

ARM Community: Cortex DDR3 memory performance - ARM Community

Jump to content

Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic

Cortex DDR3 memory performance Rate Topic: -----

#1 User is offline   webshaker 

  • Regular Contributor
  • PipPipPip
  • Group: Members
  • Posts: 220
  • Joined: 07-October 10

Posted 11 January 2011 - 10:55 AM

Hi.

Board like beagleboard or pandaboard have very poor memory accès speed !
Do you know if it's due to LPDDR memory ?

Does server like ZT systems R1801e have better memory performance with the DDR3 memory ?

I dont' want to (I can't to be exact ;)) buy a 20.000$ server !!!
Does it exists a motherboard (with only 1 cortex A8) with DDR3 memory where I could test memory performance ?

Thank's
When you have eliminated the impossible, whatever remains, however improbable, must be the truth
0

#2 User is offline   isogen74 

  • Super Contributor
  • PipPipPipPip
  • Group: Members
  • Posts: 1098
  • Joined: 20-March 07

Posted 11 January 2011 - 06:18 PM

What do you call slow access speed, and what are you comparing it to?

Side note - it is worth noting that CPU and memory system performance varies from device manufacturer to device manufacturer - each makes their own trade off in terms of device silicon area, power, etc - so a Cortex-A8 device from one vendor may have much faster memory / faster core / more power budget / etc than another vendor. It's worth shopping around to see what is available.

I'm guessing you are talking about memory latency (CPU cycles) from CPU to external memory?

>> Do you know if it's due to LPDDR memory ?

It's a combination of factors - but all with the same root cause (the aim to have lower power). Essentially LPDDR makes three sacrifices to achieve lower power consumption:

  • Lower voltage - LPDDR typically operates at 1.8 volts rather than 2.5 for other typical mobile electronics - which means transistors take longer to switch because they are being driven less hard. This results in ...
  • Lower frequency or higher latency. If transistors are not switching as fast you have two choices - drop the clock rate for the same latency, or keep the clock rate but increase pipelining which increases latency. The latter is quite common - so CPU visible access latency goes up.
  • Less IO pins. 16 and 32-bit LPDDR interfaces are not uncommon, so if you want to transfer a 32-byte cache line it takes many more memory cycles than a wide non-LP RAM which is typically 64-bits wide. This added latency is top of the increased pipeline length seen above.

The same tradeoffs are made inside the CPU with the bus which links the CPU to the memory controller - so that also adds latency - as does the fact that the CPU is clocked significantly faster than the memory system.

If you are willing to spend more power to get increased memory system performance you may want to look at ARM cores designed for non-mobile applications such are network appliances - these commonly have power connected to the mains so have designs which are speed-optimized rather than power optimized as they don;t have to worry about preserving a battery =)

The Shiva Plug is based on a Marvell ARM core (http://www.highsecla...lug_sp1100.html) using standard DDR2 memories and has very low latency to memory - but burns significantly more power (quite a few multiples more power) than the OMAP series of chips you mentioned. Also note that DDR3 is a higer latency, but higher bandwidth memory than DDR2. If you are measuring latency from the CPU then DDR3 may well be slower than you expect if you are used to DDR2 numbers.

This post has been edited by isogen74: 11 January 2011 - 06:27 PM

When optimizing software, consider that the quickest code to run is the bit you removed from the call path.
0

#3 User is offline   webshaker 

  • Regular Contributor
  • PipPipPip
  • Group: Members
  • Posts: 220
  • Joined: 07-October 10

Posted 12 January 2011 - 08:51 AM

Thank you.
I think you reply to my question.

In fact, I'd like to use CORTEX A8 (or A9) processors in a server farm to realise a
background task (using NEON) for experiement purpose for the moment.

I have tested the beagleboard and the pandaboard.
The test was JPEG file decompression.
For the moment it seems that using a dual quad-core 3ghz Xeon (coded in C) is faster than 16 pandaboard dual-core 1Ghz Cortex A9 (partially coded in assembler).

The main problem is due to memory acces.
The CPU is fast enough. And we could imagine a big Cortex Server farm if we can solve our memory acces problem.

The other problem is that most of ARM processor do not include NEON.
This is the case for the SPEAr1310 used by ZT systems.

So... I'm looking for a Cortex motherboard :
- small enough to be installed in a 1U server rack.
- using NEON
- and having low latency to memory acces

Etienne
When you have eliminated the impossible, whatever remains, however improbable, must be the truth
0

#4 User is offline   isogen74 

  • Super Contributor
  • PipPipPipPip
  • Group: Members
  • Posts: 1098
  • Joined: 20-March 07

Posted 12 January 2011 - 06:07 PM

On thing to remember when directly porting code (especially SIMD data-plane stuff) is that typical x86 server chips have huge L2/L3 caches - you can quite easily keep 16 or even 32MB of data within 20 cycles of the core with practically no effort on behalf of the programmer. For a mobile device we can't easily do this with our power and silicon area budget, so the L2 cache is typically a lot smaller - typically somewhere between 256KB and 512KB per core in a multi-core system would not be uncommon.

If you stray outside of the dataset held in this smaller L2 you start incurring round-trip latency to main memory to fetch things to put in the cache. So ... if you are doing data-plane tasks using NEON you should look at making aggressive use of explicit data preload (PLD) or the A9 programmable prefetch engine to load data into the L2 while you are processing previous data blocks. I've not looked at OMAP4 at all, but for ARM devices the round-trip latency to memory is often comparable to dekstop chips (desktop chips are around 120 cycles latency to external memory, I've seen 1GHz ARM devices from various manufacturers range from ~65 cycles to ~250 cycles), so the only real difference is the smaller L2 cache. If you can preload data early (i.e. at least <latency> CPU cycles before you need it) you can hide the overheads caused by cache misses, provided you have enough bandwidth in the memory system for the volume data you are loading.

Iso
When optimizing software, consider that the quickest code to run is the bit you removed from the call path.
0

#5 User is offline   webshaker 

  • Regular Contributor
  • PipPipPip
  • Group: Members
  • Posts: 220
  • Joined: 07-October 10

Posted 13 January 2011 - 08:39 AM

I've tried PLD

It produce a strange result
For example for the YUV to RGB conversion, I've made something like this

引用

PLD [r0, #192]
PLD [r1, #192]
PLD [r2, #192]
vld1.8 {q0}, [r0]! @ 16 Y component
vld1.8 {q1}, [r1]! @ 16 U component
vld1.8 {q2}, [r2]! @ 16 V component
--- do the conversion ---
str3.8 ... @ Save 8 interlaced pixels
str3.8 ... @ Save 8 interlaced pixels


After having bench it seems that the first buffer was pretty fast while the other (R1 and R2) were still very slow

so I change the storage form.
I've interlaced 16 components Y followed by 16 U and followed by 16 V

I used the same code but with only one pointer register.

引用

PLD [r0, #192]
vld1.8 {q0}, [r0]! @ 16 Y component (use r0)
vld1.8 {q1}, [r0]! @ 16 U component (use r0)
vld1.8 {q2}, [r0]! @ 16 V component (use r0)


And this code is very faster.

So I supposed that there is only 1 line buffer for the PLD instruction !

The save (vst3) are always very slow. I don't find anything do to do improve their performance.
I still have much work to understand the cortex A8. But it is a very interesting processor.
I really hope we'll could used it for our project.
When you have eliminated the impossible, whatever remains, however improbable, must be the truth
0

Share this topic:


Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic