Memory regions

There were quite a few memory regions that were managed by the Kernel, but which had no representation in the Dynamic Areas list. These anonymous areas were either historic, or private to the Kernel (or both). In some cases TaskManager had intrinsic knowledge about them, and would account for them in its display. These had become fewer as the OS was developed, and by the time RISC OS 4 was released the number of special areas had decreased to more sensible levels.

Some areas were still merged together because that was how the Kernel had allocated them during system start up. With later Select versions, I moved these apart as much as I could, either by moving them to the modules which owned them (in line with the principles of Modularity laid out previously), or by defining them through the regular interfaces (rather than by directly populating the internal Dynamic Area list).

In the past it had been common to expose the address of memory mapped regions through the SWI OS_Memory 8 or SWI OS_ReadSysInfo 6 interfaces. For example, finding the address of the page tables was done through SWI OS_ReadSysInfo 6, 13 or 14, and similar calls for other areas. Whilst this filled a gap that these areas were not exposed previously, it seemed rather tacky to me that they be somehow special enough to warrant their own interface.

Many of these areas had new Dynamic Areas allocated for them, which meant that they were both recognisable as part of the logical memory map by enumerating the areas, and needed no special interfaces as the address and size of the area could be obtained through the normal SWI OS_DynamicArea interfaces. The system dynamic areas have fixed area numbers, so it is always possible to find them. The various processor stacks, and ROM were made into areas, along with the page tables and other special use areas.

Some areas, though, were just removed from the Kernel. The system sprite area moved to be controlled by the SpriteUtils module, and was considered legacy in any case. The RAM disc was controlled by RAMFS. The Font area was controlled by the FontManager. The VRAM rescue area, and screen area were removed entirely. Actually the Screen Area was faked through the LegacyScreen module just so that anyone that read the area could work - usually very old games that would read the size, or resize, the area to the size they wanted the screen memory to be.

Some of the areas moved around a little to make space for a larger application space, and some of the areas were able to be moved far higher in memory than they ever had been previously. However they weren't going to work if they had executable code in them on 26bit systems, so they needed to wait for the 32bit work to be completed before they could complete their migration northward.

Some areas were easy to reduce in size - the cursor area had moved out of the Kernel's control so the area it was in could be reduced in size. Similarly, as the System Variables had been moved out of the Kernel (to the SystemVars module) the system heap didn't really need to be as big as it was - it could be reduced in size, freeing up space for other things.

OS_ValidateAddress

With new memory regions existing through the Dynamic Area interface, it should seem reasonable that you could find the existence of memory at an address by using SWI OS_ValidateAddress. However, this call is limited - it actually only says that the area is available. It doesn't tell you that you will be able to access that memory - for example it could be abortable, it could be inaccessible in USR mode, it could be read only (the SWI only tells you that it exists, not that its writable).

For the BTS (Back Trace Structure) debugging system, the DiagnosticDump module needed to know what memory it could actually access - in many cases it would try to determine what regions were suitable for dumping to the disc, and the SWI OS_ValidateAddress call was not going to be useful. In particular, physically mapped areas should never just be blindly read - even the act of reading them might trigger destructive operations. There was a possibility that some areas weren't meant to be read - we might want to only record areas that are accessible in USR mode.

A new SWI call (SWI OS_Memory 24) was created to provide some of this information. It could say whether the memory region requested was readable, writable, physical or aborting, and whether the whole or only part of the region met those conditions. Having done this, the SWI OS_ValidateAddress call was made to just invoke the new SWI OS_Memory call.

Protection of memory

Various areas of memory had different levels of protection in Select 4 compared to earlier versions. Some areas had already been protected as they had no right to be written to by code running in USR mode. For example, the System Variables area does not need to be written to from USR mode, so is read only there - it does need to be readable, though, because the names of variables are returned as part of the enumeration when SWI OS_ReadVarVal is called.

SVC stack

Many other areas were treated similarly, depending on whether they needed to be accessed from USR mode or not. The SVC stack, however, was given a very slightly different treatment. The base of the stack contains important information that modules use to find data - the C library uses it to find the private workspace for the module (in lieu of using a separate static base register), and the BTS structures are also stored in the same area. Corrupting these would mean that both the current module would be unable to access its data, and any backtrace information would be broken - probably causing the backtrace itself to fail.

The abort handler was updated to check whether the SVC stack pointer was wild (pointing to areas of memory that it should never be), or in the low regions of the SVC stack. This was then converted to a special error which was then reported - and the backtrace system should capture the stack and be able to report the failure usefully. To prevent the backtrace structures themselves being corrupted, the 2nd kilobyte above the stack base was flagged as read only in all modes (except on ARM6 systems, which didn't support that). Unfortunately this was the highest granularity that the page could be protected. To get better granularity, I would have had to switch to using 'tiny' pages, which were not supported by the Operating System, nor by any of the processors we worked with (if I remember correctly).

The SVC stack would, previously, be immediately followed by the system heap. This meant that reading off the top of the SVC stack (stack underflow) would still get valid data, and wouldn't necessarily crash. This didn't seem a problem, except that the Toolbox modules did something like this (by accident) when they were called. Again, it wouldn't be a problem except that on Pace's OS they had changed this so that there wasn't any memory above the SVC stack - and the Toolbox module crashed. Fixing the code was obviously vital as it was broken, but also it highlighted the issue that other things might do that.

An gap of a page was inserted between the SVC stack and System Heap to make such things safer - it wouldn't make a huge difference in practice, but might catch other similar cases in the future.

Zero page

Zero page, the addresses from 0 to &8000, was made read-only in user mode to prevent operations which might overwrite the important Kernel workspace. It had been an option to make the memory inaccessible in user mode, but this was impractical - too many known interfaces and exported parameters were referenced in the region.

Soon after the RISC OS 4 release (I think it was), we made some changes for the way in which the data in the first 1K of memory was managed. Normally, if you branch to address 0 you get an 'Branch through zero' address - in C terms, you have called a NULL function pointer. To trap this more directly I thought it would be useful to make the instruction at this address an undefined instruction, and then trap the case of an undefined instruction at address 0, converting it to the relevant error. The undefined instruction I chose, &E7FFFFFF, was a problem for some applications.

It is a problem because the address of the byte at 0 is no longer 0. Why does that matter? Some applications rely on it. They assume that passing a pointer to 0 will mean that same as an empty string. Is that wrong? Of course, but to break them arbitrarily is unreasonable. Independent of my discovering this, I remember a frustrated email from Simon Birtwistle saying that we had broken the FontManager.

There is an odd bug (which I never fixed, because it is in the FontManager and I didn't want to go into those depths at the time) that it will read one character beyond the specified length of the string. Even if you specify that the length of the string as 0, the byte at the string pointer will still be read. This even happens if you haven't got kerning enabled. I did look at it, but it very quickly became complex. In his case, he was calling the FontManager with a pointer to 0, and a length of 0 and where everything had been fine before, the SWI now returned an error about an invalid character.

Restoring the code at address 0 made things work again, and removed the undefined instruction change. The undefined instruction method was better in some respects because it meant that if called in user mode you could actually store all the registers safely - otherwise it was difficult to preserve all the registers.

One of the more evil things that was done in some places was to use a STR r0,[r0,-r0] in order to preserve a register when you have no other registers known to you - it stores at word 0. The system would then put back the instruction that was at that address once the system had preserved its remaining state. This was excised from the Kernel code - I have a vague recollection of it being used as part of the Supervisor code at one point.

Error reporting

The reporting of errors differently gave me some ideas of how reports might be made more useful when something went wrong. Changing the 'stack overflow' messages for the SVC stack was one thing, but many of the other error messages had not been touched in a long time.

When RISC OS 3.1 was released, many of the system error messages were updated to be clearer. Previously, the error messages had mostly been carried forward from the BBC, or been just plain statements of problems. Many of the errors were made more wordy to be friendlier to users, and some were rewritten entirely. So there was some precedent for the change.

Partly, I wanted to make it clearer what the problem was when things went wrong. But this was also driven by the fact that the rearrangement of memory for the 32 bit system would mean that some applications might not work due to assumptions, and it would be useful to have immediate feedback on that. Of course, the DiagnosticDump module would help if the application was written in C, but there were still a lot of things written in assembler and BASIC which wouldn't be helped.

As an experiment I created a module which would augment the error handling for the aborting of certain areas. One of the enhancements to the SWI OS_AbortTrap handlers was the ability to return a hardware error to cause that error to be used in place of the standard abort message. The idea had been that accesses could be vetted so that (for example) a memory mapped file could report errors in attempts to access beyond the mapped file's regions, but this could also be used to report other messages about the accesses.

The ReservedMemory module would register a number of SWI OS_AbortTrap handlers on regions of memory which were known to be bad to access. The module would try to determine what code performed the bad operation, and report the type of operation that was performed. If the code was C, it would try to locate the function signature of the accessing routine as well.

*BASIC
BASIC V version 1.36 © RISCOS Ltd

Starting with 2093308 bytes free

>PRINT ?&1f00000

Internal error: Abort on data transfer at &03AC7814 by module BASIC, offset &4810 (byte read of Old IRQ stack / SWI dispatcher)
>QUIT

Shows that the BASIC access to the memory that was no longer present for the SVC stack would be reported as within the BASIC module and that the access was a byte access.

*memoryi 1F88000+4
01F88000 : 
Internal error: Abort on data transfer at &038A6F48 by module Debugger, offset &1F44 (byte read of VIDC 1 screen)

Shows the error reported when the Debugger tries to read a word to disassemble the code at that location.

*testabort
PRESS A SPECIAL KEY FOR TEST CODE!
a= abort (C-style)
d= data abort 
D= data abort in old IRQ stack 
e= error 
p= prefecth abort 
u= undefined instruction 
f= free of already freed memory 
m= malloc block corruption 
A= assert error 
s= stack overflow 
z= branch through zero 
ESC= just raise escape 
Others are ignored...

    8000 : .... : E1A00000 : MOV     r0,r0
D
Internal error: Abort on data transfer at &0000859C, routine dis_numberval (word write to Old IRQ stack / SWI dispatcher)

Postmortem requested

Processor registers
  a1/r0 =0x01f00000, a2/r1 =0x00000000, a3/r2 =0x00000009, a4/r3 =0x80808080
  v1/r4 =0x00000010, v2/r5 =0x00009400, v3/r6 =0x0000a418, v4/r7 =0x0000000c
  v5/r8 =0x00000004, v6/r9 =0x0000000c, sl/r10=0x0000b62c, fp/r11=0x0000c2ec
  ip/r12=0x0000c2cc, sp/r13=0x0000c2d0, lr/r14=0x00000000, pc/r15=0x4000859c

Stack backtrace
  Arg1: 0xe24cc010  -498286576
859c in function dis_numberval
  Arg1: 0x0000c358       50008 -> [0x60008144 0xffffffff 0x2385c688 0x00000000]
88d0 in function dis

The 'testabort' program was a mangled version of my pre-University work on an ARM disassembler - which just happened to be handy to stick a block of code in that could generate a number of different types of aborts. It was used initially to test the handling in the SharedCLibrary of different error conditions, and later to check that the DiagnosticDump module performed correctly in each of the cases. There are other cases that can be tested for the failing SharedCLibrary operations, but they are a little more invasive and tend to need more manual triggering.

I had added a new error trigger 'D' for writing to the old IRQ space to check that the errors were reported in a C application. In the above example, it has correctly detected and reports that the error is in the 'dis_numberval' function. As an aside, the DiagnosticDump also correctly traps this (because it is just like any other error).

The module itself was pretty reasonable, although there were still some bits to be finished, so that it could prevent itself causing more aborts. I had also wanted to trap accesses to other areas, for example a write to the ROM should report a message in a different form as well, despite not being a 'reserved' area. In theory it should be possible to report similar messages for any sort of abort - assuming that it hadn't been created manually through SWI OS_GenerateError. Whether that would be more useful is a question I hadn't resolved.

Part of the problem with extending the diagnostics in this way was that it produced more noise for the non-technical user. Whilst there was a lot of information that was now being collected to allow reports to be made usefully to application authors, it wasn't always obvious what was needed. The DiagnosticDump module would catch a lot of the C-based application and module problems, and produced a file that could be sent to the author and should give them good information on what happened.

The error logs caught many of the messages that could be reported - which would prevent a report of 'there was some error, but I don't remember the message'. These errors would be augmented with any of the messages that the ReservedMemory module provided, so at least that information might be captured for the cases where the DiagnosticDump module wasn't able to help (BASIC and assembler applications mostly).

Whether this was something that was useful to users was a different matter. Having the information in different places isn't as useful if the user has to collect it. Indeed, they may not care and being presented with a list of numbers and internal information for applications that died isn't something that I, as a user, want to see.

A possible way - and one that I was aiming towards - was to collect the data in the background, and offer the option to submit it to the relevant parties. Doing that collection, determining the 'relevant parties', offering the option to the user in a useful manner, and ensuring that the user's privacy was respected, were all to be done.

Obviously that is a whole area of work in its own right, but one that needed to be dealt with. I liked exposing the information about what went wrong, but I think that these were areas that would need to be configured. The DiagnosticDump module had its own configuration, so that the amount of data collected could be controlled, but this would need to be expanded.

Anyhow, the module itself worked and might have been finished along with parts of the 32bit or A9 work, but wasn't quite there yet.