Login

Important information

This site uses cookies to store information on your computer. By continuing to use our site, you consent to our cookies.

ARM websites use two types of cookie: (1) those that enable the site to function and perform as required; and (2) analytical cookies which anonymously track visitors only while using the site. If you are not happy with this use of these cookies please review our Privacy Policy to learn how they can be disabled. By disabling cookies some features of the site will not work.

ARM Community: Cortex A8 preload engine (PLE) error - ARM Community

Jump to content

Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic

Cortex A8 preload engine (PLE) error I'm getting an error when loading data into the L2 cache with the Rate Topic: -----

#1 User is offline   TedM 

  • Member
  • Pip
  • Group: Members
  • Posts: 8
  • Joined: 01-June 11

Posted 24 November 2011 - 06:57 PM

I have a user-mode Linux application running on a Cortex-A8 (a TI 8148 Davinci chip). I have a shared memory region that I'm using to communicate data back and forth between the ARM core and the TI c674x DSP. The shared memory region is a ring buffer made of 32k segments (the size of the 8148's L2 cache ways). I've locked down 3 of the L2 cache ways and I'm trying to use the L2 PLE (preload engine) - the L2 feature accessed through coprocessor 15 c11 - to asynchronously preload and writeback the ring buffer segments. The ring buffer itself is located in physically and virtually contiguous memory - we're using TI's cmem module to allocate out of a memory hole. Moreover, I've checked the linux struct page flags for the ring buffer pages and they seem to all be uniform and fairly kosher. Plain-vanilla loads and stores from the ring buffer work just fine, as do coprocessor 15 based cache writeback operations (performed in privileged mode, of course).

Anyways, everything goes quite nicely for a while (anywhere from 3 to 10 PLE transfers complete successfully), until a PLE transfer errors-out at a page boundary. It's a different page boundary (both virtual and physical address) each time, and it's a different number of ring buffer segments and a different number of pages into the ring buffer segment each time this happens. The error itself, from table 3-132 in the ARM Cortex-A8 Technical Reference Manual, is "b1000101", or "translation fault, section".

Does anyone know what this error means? At first I thought that maybe it was because the page was marked as uncached, but looking at the page properties (with /proc/kpageflags), that doesn't seem to be the case.

Edit: One more detail - this failure only happens with preload operations - not writebacks. Or at least I haven't seen it happen with a writeback yet.

This post has been edited by TedM: 24 November 2011 - 11:18 PM

0

#2 User is offline   isogen74 

  • Super Contributor
  • PipPipPipPip
  • Group: Members
  • Posts: 1097
  • Joined: 20-March 07

Posted 25 November 2011 - 04:40 PM

My guess is that you set the PLE running on one set of virtual addresses, and then your OS content switches, and the CPU page tables change. The VA the PLE is trying to use doesn't exist in the new processes address map. It will always fail at the start of a page or PLE range, as this is the first time it will see a translation fault from the MMU.

HTH,
Iso
When optimizing software, consider that the quickest code to run is the bit you removed from the call path.
2

#3 User is offline   TedM 

  • Member
  • Pip
  • Group: Members
  • Posts: 8
  • Joined: 01-June 11

Posted 25 November 2011 - 06:01 PM

View Postisogen74, on 25 November 2011 - 04:40 PM, said:

My guess is that you set the PLE running on one set of virtual addresses, and then your OS content switches, and the CPU page tables change. The VA the PLE is trying to use doesn't exist in the new processes address map. It will always fail at the start of a page or PLE range, as this is the first time it will see a translation fault from the MMU.

HTH,
Iso


Ah - this makes perfect sense - thank you. That should mean that restarting the transfer when I see the error status should be harmless, no?

This raises another question, though. I'm using the PLE and I'm seeing a small amount of corruption in the data that makes it to the DSP. I'm wondering if maybe Linux is switching to another process which coincidentally DOES have that same VA mapped, and my L2 cache data is getting written out to the wrong place (ie a page in that other process)? What's supposed to prevent this from happening? Should I be setting the PLE Context ID register (MCR p15, 0, <Rd>, c11, c15, 0) to something meaningful?
1

#4 User is offline   isogen74 

  • Super Contributor
  • PipPipPipPip
  • Group: Members
  • Posts: 1097
  • Joined: 20-March 07

Posted 25 November 2011 - 07:00 PM

Quote

I'm wondering if maybe Linux is switching to another process which coincidentally DOES have that same VA mapped, and my L2 cache data is getting written out to the wrong place (ie a page in that other process)?

Yes, corruption would certainly result if the VA->PA translation changed to something else and the PLE was still running.

Quote

What's supposed to prevent this from happening?


Is suspect the answer is "software" =)

I'm not a PLE expert, but AFAICT the PLE uses the same page tables as currently mapped on the core, so if the OS context switches from one process to another you either have to (1) stall the context switch waiting for the pending PLE reqeusts to complete, or (2) cancel pending PLE requests,
(3) "pause" the transfer, switch the process out, and "resume" when it gets switched back in again.

Cheers,
Iso

This post has been edited by isogen74: 25 November 2011 - 07:06 PM

When optimizing software, consider that the quickest code to run is the bit you removed from the call path.
2

#5 User is offline   Jerry Fan 

  • Contributor
  • PipPip
  • Group: Members
  • Posts: 56
  • Joined: 10-January 11

Posted 26 November 2011 - 06:44 AM

Or you can allocate the shared memory in the kernel space, since for every Linux process, the kernel space shared the same MMU table entries.
1

#6 User is offline   TedM 

  • Member
  • Pip
  • Group: Members
  • Posts: 8
  • Joined: 01-June 11

Posted 28 November 2011 - 11:03 PM

 isogen74, on 25 November 2011 - 07:00 PM, said:


Yes, corruption would certainly result if the VA->PA translation changed to something else and the PLE was still running.



Is suspect the answer is "software" =)

I'm not a PLE expert, but AFAICT the PLE uses the same page tables as currently mapped on the core, so if the OS context switches from one process to another you either have to (1) stall the context switch waiting for the pending PLE reqeusts to complete, or (2) cancel pending PLE requests,
(3) "pause" the transfer, switch the process out, and "resume" when it gets switched back in again.

Cheers,
Iso



I wonder - I've played around with this a bit and it seems that the PLE ContextID register might be the key here. I suspect that the ASID field in that register needs to match the ASID field in any TLB entries used by the PLE to do it's address translation. With ARMv7 apparently the ASID is part of the TLB lookup - if the current contents of the global ContextID register (c13, c0) don't match the TLB ASID, then the TLB entry won't be a match. It seems like maybe the PLE ContextID register (c11, c15) might serve a similar purpose for these asynchronous PLE transfers.

Unfortunately, Linux seems to change the ASID whenever it rolls over to 0 (it's an 8-bit counter) - so I'm not sure that I could guarantee that my process's ASID is always going to be the same? If not, I'd have to set the PLE ContextID ASID often enough to do reliable transfers - and the PLE ContextID register is only accessible in kernel-mode. One of the big reasons I'm trying to use the PLE in the first place is to avoid an expensive syscall when writing back memory - it's fairly expensive on this platform (about 8000 cycles for a binary sysfs attribute access, and more for an ioctl or a character sysfs attribute access).


The real problem that I'm having now seems to be writing the L1 cache back - I've figured out that most (all?) of the corruption I'm seeing now is due to writing the L2 cache back with the PLE but not the L1 cache.

This post has been edited by TedM: 29 November 2011 - 12:23 AM

0

#7 User is offline   TedM 

  • Member
  • Pip
  • Group: Members
  • Posts: 8
  • Joined: 01-June 11

Posted 29 November 2011 - 12:25 AM

 Jerry Fan, on 26 November 2011 - 06:44 AM, said:

Or you can allocate the shared memory in the kernel space, since for every Linux process, the kernel space shared the same MMU table entries.


Yes - I considered this too, but I'm not sure my Linux kernel-fu is quite advanced enough yet to accomplish this. We've been using a /dev/mem -like tool that TI provides called cmem to map contiguous memory chunks, and it doesn't provide the ability to use kernel logical mappings - just user mappings.

This post has been edited by TedM: 29 November 2011 - 12:26 AM

0

#8 User is offline   isogen74 

  • Super Contributor
  • PipPipPipPip
  • Group: Members
  • Posts: 1097
  • Joined: 20-March 07

Posted 29 November 2011 - 09:30 PM

Quote

With ARMv7 apparently the ASID is part of the TLB lookup - if the current contents of the global ContextID register (c13, c0) don't match the TLB ASID, then the TLB entry won't be a match.


Yes the aim of the ASID is so that you don't have to flush the TLB on context switch. What I am unclear on is what happens when you get a TLB miss when the PLE is running. I assume it would perform a table walk using the current page tables, but populated with the ASID value out of the ContextID register. Which probably isn't what you wanted it to do (I guess you would want it to stop on an ASID mismatch for your usecase).


Quote

Unfortunately, Linux seems to change the ASID whenever it rolls over to 0


Yes, that's the other issue. If you have more than 255 processes active at the same time you will get ASID rollover, so it is time variant.

I think Jerry is on the right lines here; the usual approach to exposing this type of hardware is to provide a device driver, so user-space allocates the memory via a kernel call to the driver, and performs special operations (start PLE transfer, for example) via a kernel call to the driver. This allows the kernel to have the memory mapped in it's address space, which solves the changing page-table problem, and you will need the kernel calls at the start and end of each PLE operation as you will need to issue appropriate L1 cache operations to ensure visibility of the data you've just shovelled into / want to shovel out of the L2.

Cheers,
Iso

This post has been edited by isogen74: 29 November 2011 - 09:31 PM

When optimizing software, consider that the quickest code to run is the bit you removed from the call path.
0

#9 User is offline   Exophase 

  • Regular Contributor
  • PipPipPip
  • Group: Members
  • Posts: 118
  • Joined: 20-July 10

Posted 30 November 2011 - 05:11 PM

瀏覽文章引用框(isogen74 @ 29 November 2011 - 09:30 PM)

Yes the aim of the ASID is so that you don't have to flush the TLB on context switch. What I am unclear on is what happens when you get a TLB miss when the PLE is running. I assume it would perform a table walk using the current page tables, but populated with the ASID value out of the ContextID register. Which probably isn't what you wanted it to do (I guess you would want it to stop on an ASID mismatch for your usecase).


According to section 8.4.1 of the TRM the PLE doesn't use the TLB and always walks the page table directly at the start of a transfer and between 4KB boundaries. This should mean that the PLE Context ID register is compared against the global Context ID register, which grants it an entire 32-bits and should avoid potential aliasing. I would guess that you should be setting the PLE Context ID register to the current value of the Context ID register. It could be that the PLE doesn't bother checking equivalence for the first page, hence why you're succeeding until the second page is hit. This would make sense since the first table walk is done before any data is transferred and is then valid for the entire page regardless of whether or not a context switch occurs, and this first table walk may have to succeed before the start operation can finish.

Section 8.4.5 does claim that if the Context ID register changes during a PLE operation the result is unpredictable.. you would think they're really referring to the PLE Context ID register, since otherwise I don't understand the point of having it in the first place (if the current process one can't change)
0

Share this topic:


Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic