Memory: Efficiency and new features | RISC OS Rambles

Memory handling

Internally the memory handling in RISC OS has not had much of a change since RISC OS 2, and probably Arthur. The mappings that the system maintains are still based on those that were exposed by the memory mapping tables in the original ARM 2 systems. Whilst the page size has decreased (from its variable size of up to 32K, to just 4K pages), much of the access to the page tables still follows the same code paths.

I was beginning to change the internal memory handling so that it could cope more efficiently with page re-mappings, particularly for cache systems that differed from those which the system still used. The early chips that we still had to support could perform cache operations based on logical addresses, which meant that on modifying the page tables the cache had to be flushed or the cache data present would be incorrect. Later chips - ARMv6 and beyond - had coherency for physical addresses. This would have significantly reduced the amount of work necessary for page re-mappings and meant that a lot of operations became significantly better - in theory, the Lazy Task Swapping changes might not even be necessary as the performance of task switching would improve. On the other hand, Lazy Task Swapping did gain improve speed for very large task slots - the trade off would have to be evaluated when such systems were in use.

Page mapping deferral

Some of the work to improve performance for these newer processors was done - moving code around and making the operations clearer. However, as the vast bulk of the support was needed for older systems, the current processor behaviour needed to be fast. One particularly fun change was that for the dynamic area handling. Dynamic areas, on resize, would go through the usual page remapping procedures, and cause cache flushes to ensure coherency. When you resized an area by a large amount this would result in a lot of cache flushes - effectively making the system uncached for the duration of the resize operation.

This might not seem too bad, but with the other things that the page remapping did this meant that resizing a Dynamic Area from 200M to 0K (for example, deleting it) took a significant length of time. On Adjust this took about 230 centiseconds. Other improvements had reduced this to just 82 centiseconds on Select 4 - which is a nice improvement, but when we're just removing memory mappings it isn't really reasonable. The solution I used - deferring coherency until the end of the operation for certain cases - reduced this time to just 11 centiseconds, which wasn't wonderful, but was certainly acceptable given the speed it had been.

The deferral also meant that the IRQ handling routines needed to be aware of the change such that they saw a coherent view of memory. Under the normal system, page tables are always consistent, even after a single page change. Under the deferral system, the page tables in use won't match those in memory due to the fact that they are cached. Any cached entries which are to be written to the remapped memory addresses won't matter because the addresses are being mapped out. At the end of the mapping change, the page tables and caches are flushed to update the necessary details. When an IRQ occurs during the remapping process, the cache and page tables are not consistent, so need to be flushed to ensure that the memory map in use, and that which is expected by the system, are the same. FIQs could also be affected by this, but FIQs cannot trigger SWIs which might need to manipulate the memory, and they would never be operating on the memory that was being unmapped (except in the case of a fault - in which case their behaviour would be as non-deterministic as it had been if the deferral were not in place).

The addition of flushes on IRQs wasn't a huge deal as the speed up meant that it wouldn't matter most of the time. In the above case of a large Dynamic Area being deleted, it would have reduced the number of cache flushes from 51200 to 12 (assuming only timers and the final flush were triggers).

These deferrals would be completely redundant on modern processors, but the gains could be seen to be significant for the extant systems.

Really, a lot of the way in which memory was handled in the Kernel needed to be ripped out and rethought. Larger memory maps and differing paging and caching strategies (remember that much of the memory system is the same as it was before there were any caches, never mind separate instruction and data caches) meant that the implementation and interfaces to the memory system were really not suitable. Much of this was to be reviewed in a later release, but never came about.

Lazy Task Swapping

In RISC OS 4, Acorn had introduced an improvement to the task swapping of applications. In earlier versions of the Operating System, each task would have the entire application's task area paged in and out with each task swap. This could get very expensive in terms of the amount of page mappings that were done. Lazy Task Swapping improved this by making the pages in the task workspace abort when the task was switched in. Each access to a page which caused an abort would cause the page table entry to be updated to contain the correct entry for the task. When the task was paged out, only the small number of pages which had been changed would need updating.

This improved the efficiency of large application swaps at the expense of more page table changes - if tasks were just being switched in and out rapidly with very little access, this would be significantly faster. However, there were known problems with this.

ARM 6 and ARM 7 had different abort handling which meant that fixing up failures on these systems was significantly harder - and had not been done. Prior to StrongARM revision T, a bug in the processor meant that the lazy task swapping couldn't be done reliably. On these processors, the LDMIB (load multiple, increment before) instruction did not handle aborting operations properly with particular constraints on the registers being loaded. This meant that such instructions couldn't be reliably replayed after the abort - and prevented Lazy Task Swapping from being enabled.

During our testing we tested revision S StrongARMs to check the behaviour, but never found any problems with the applications we were using when the feature was forced on. In any case, we left the feature as it was, because the errata was very clear that it could be a problem - the fact that we didn't see any problems didn't really mean that it wouldn't be a problem.

The abort handler for the application space is very short, in comparison to the 'full' trapping which is present for other operations. Early in the abort handler, the address being paged in would be filtered off and passed to the handlers for fix up. Once fixed up - the page tables rewritten to their correct value - the system would return to the failing instruction to be executed again.

However, this causes a problem in some circumstances. As discussed previously, the memory handling in RISC OS dates back a long way, and hadn't really been updated to take account of the new Lazy Task Swapping abort handling. If a page was protected (for example, made read only in USR mode) using the defined interface (SWI OS_SetMemMapEntries) and Lazy Task Swapping was enabled, the abort handler was oblivious to this. If you wrote to the read only page, the system would see an abort. The abort handler would ensure that the page entry was the one that it expected (which it already was), and would then re-execute the instruction. Which would then abort in exactly the same way.

A few applications and tools would try to protect the application space, and this caused a problem. There was also an issue that these protections wouldn't actually be preserved some of the time. The calls that checked the page tables (SWI OS_ReadMemMapEntries, SWI OS_FindMemMapEntries and SWI OS_Memory) were unaware of the new state of the page tables, and would claim that there were no pages in use if they hadn't been paged in yet, rather than returning the pages that would be there when they were mapped in. The Lazy Task Swapping should be entirely transparent to these interfaces. I fixed a few of these whilst working on the abort trap handling and testing the behaviour of different mapping operations.

RISC OS 4 dynamic area flags

RISC OS 4 had introduced a few new concepts for Dynamic Area support - those of application bound, shrinkable, and sparse areas.

Application bound areas

The application bound areas were intended to be paged in and out with the application, but had not been implemented in RISC OS 4. They were not pursued by myself during the development of Select. They might have been followed up later, as more of the environment was rationalised, but for the extant versions it was a redundant flag.

Sparse areas

Sparse areas were intended to allow large dynamic areas to have areas of unmapped memory in them. For example, the module area could have holes where the memory had been freed. Such use would reduce the actual memory usage of the system, although the memory fragmentation would remain. It had been planned that we could use the sparse areas for things like the RMA, but the gain would be small. In theory, cache areas could use this to reduce their memory usage. As discussed later, I wrote a few examples of code which created a virtual memory system in a Dynamic Area, but its use never progressed - in the OS - much further than this. The areas were tested and were very reliable, though.

Shrinkable areas

Shrinkable areas, on the other hand were used in a few places. The principle with shrinkable areas was that often the data in the area didn't need to be there - either because it was a cache, or the memory was just pre-allocated to save time later. The Kernel, when asked how much memory was free, could query these shrinkable areas to ask how much memory they might reclaim. Generally, handlers should return very quickly from these requests as there could be a lot of calls. If the Kernel ran out of actual memory in its pool, it can request that areas shrink themselves by an amount. This allows them to free up some space if their workspace isn't actually required.

The ConvertGIF module kept a cache of the decoded image data whilst it was being used, but allowed the area to be shrunk if memory was short. This meant that if the same image was plotted repeatedly it would be fast, but if we ran out of memory, that cache would be able to be released. SpriteExtend did a similar thing for its JPEG decode cache and the JPEG transcode buffer.

Although shrinkable areas meant that working out the amount of free space in the system was slower - it was no longer a simple calculation of the remaining pages in the pool - it meant that memory could be used for data which could be expired, or was only retained in case it might be needed to speed up later operations. This meant that when memory was needed it could be made available, where previously such memory would be locked up unless the user took explicit action to free it.

Select dynamic area features

Locked areas

Select 1 introduced Lockable areas, although they were not used much initially. The idea of these areas was that once locked, using a code provided by the locker, the area could not be manipulated without first being unlocked. Mainly this prevented the 'accidental' deletion, or resizing of the wrong area. For example, resizing an area owned by another module - or deleting the area - might cause that module to die rather badly. Preventing these sorts of unintentional changes was the intention of the locked areas. It also prevented the deletion of such areas.

A few areas were locked in this way as the system was updated - particularly as areas moved out of the Kernel. Whereas some dynamic areas might be killed without any serious consequences (except obviously the modules that used them being upset), others are widely used and changing them was likely to be fatal. For example, when the VIDC memory area was initially prevented from being resized, it had a resize handler that rejected size changes. This was later changed to be a locked area, and the resize handler complexity could be removed.

Heap areas

The primary use for a Dynamic Area was to isolate memory allocations for a module. Prior to Dynamic Areas being commonly used, it would be normal to allocate module memory in the RMA. As all the memory allocations done in this way were mixed together, it was both difficult to decide who they belonged to (although see the Debugging ramble for a solution I came up with), and corruption would be fatal not just for the module that overran its buffers, but possibly for other components which might be more dangerous.

Unless the Dynamic Area was being managed by some custom memory allocator, they were invariably managed using the SWI OS_Heap style calls. This handling code would be used in every module that needed to allocate memory in its own area. Before I joined RISCOS Ltd, I had a small template library that I copied between the modules which used the heap allocations in different ways. This became a library that could be customised by a module specific header file to handle different types of allocations and dynamic area sizes and permissions.

But the fact that we were doing the same thing repeatedly indicated that there was a gap in the system's functionality. One way that I could have handled this was to have a separate module which provided the functionality of dynamic area heaps. This would make a lot of sense from the point of view of modularity, and the isolation of features. It is possible that if I did it now, that's the way that I would do it. However, at the time it seemed more logical to add a new dynamic area flag for 'Heap Dynamic Area', and a few new reason codes to manipulate blocks within the area.

I think in retrospect this was a good decision - the area ends up with an explicit flag that says what it is used for (which makes it easier for inspection purposes - mostly for the Diagnostic Dump and addr handling) and means that the operations benefit from the faster area lookups used by the system. Still, it could have been done more in line with the other changes.

Initially, the heap areas were just tied to a basic SWI OS_Heap which used the minimum space required by the heap, with a lower limit of 1 page. This was fine most of the time, and many modules use this interface - it is a lot easier than doing it yourself, and it reduces the need for any special handling in the module unless the blocks being allocated area of fixed size (where it would be easier to manage yourself usually).

Later, the handling improved such that an empty Heap Dynamic Area would use no memory, as that was quite a common case when modules were not in use. A little instrumentation indicated that it was actually quite common for Heap Dynamic Areas to thrash their allocations, claiming and freeing blocks in alternation regularly - usually for sequences of allocations which were made repeatedly and then freed. This resulted in the area growing by a few pages, only to be shrunk back again shortly afterwards.

To improve the performance, a few pages grace were allowed in the Heap Areas. In tests the most useful trade off seemed to be 3 pages unused - this allowed for the most reasonable overhead of pages being reallocated against the wastage of memory from (potentially) 12K per dynamic area being unused.

Name/Nickname
Email address
Date	Tue, 19 Feb 2013
Comment