Resilience

There were some significant steps taken to improve resilience of the system during the Select development. There are a few main areas that cause system instability:

  • Corruption of zero page.
    Usually this was caused by one of:
    • Bad accesses by applications or modules.
    • Unsuitable access by components that believe they know about Kernel data structures.
    • Poor validation of data passed to the Kernel, or other modules.
  • Corruption of other system areas.
    Such as areas used by the system, like the module area, system heap, or private dynamic areas. These were often easy to break due to buffer overrun or incorrectly dereferenced data structures.
  • Ease and necessity of access to privileged mode.
    Coupled with corruption in the above areas, this caused a significant amount of problems. In many cases it is required to change to privileged modes in order to implement certain APIs, and to access hardware.
  • Poorly thought out APIs and implementations.
    Toolbox, in particular, was easy to get into a state which would cause applications to lock the system in endless loops of errors.
  • Unsafe practices in modules and tools when running within preemptive environment.
    Essentially 'stuff which doesn't expect to be run within TaskWindow'.

Some significant work had been done to try to isolate privileged mode clients (modules) from their user mode clients (applications). In particular the addition in RISC OS 4 of user mode read-only zero-page made for a slightly safer system - no longer was overwriting the start of memory allowable and fatal. Other areas of memory were restricted in their access over time, usually retaining readability in user mode, but generally trying to make them only accessible where necessary.

As the module area was a shared region regularly used by applications (and required in some cases), moving allocations from there meant that there was less likelihood of data corruption due to buffer overruns. Modules were updated to isolate their memory allocations to their own dynamic areas. This restricted the risk of problems for that module - if a component overran its buffer it would only corrupt its own data. From a debugging point of view this made things significantly easier, as corruption in a given dynamic area would probably mean that the one user of that area is at fault. Restricting the areas so that they were not writable in user mode helped prevent random writing to memory from affecting modules.

The side-effect of this was that any components which continued to use the module area and did overrun their space would most likely hit a piece of code, rather than data. The effect of this was therefore more likely to be significant. I played with changing the module area to being read only, which was not a good solution because many modules could write to themselves. An idea which had been started with the first post-RISC OS 4 development was to create a read-only module area, where modules which were suitably flagged could live. This area would only be writable when the modules were loaded, and only for the region that needed updating. The benefit from this would be that those modules should be safer - and unable to be modified unintentionally either by themselves or others.

Setting aside another 16M for a code area in a 26bit system was not really reasonable, so initially the area was going to be small. The work was abandoned for a variety of reasons, but still existed and would be resurrectable without too much work. In a 32bit system, this sort of usage would be significantly easier as the address space available was larger.

Many of the APIs require that they be entered in privilege modes. In particular, many of the environment handlers are privilege mode entry points. This means that any failure in your application's handling of an error could leave the entire system dead - for example, accidentally clearing your application space ('*InitStore' for example). Reducing the need for these to be called in privileged mode would make for a much safer environment.

Similarly, some protection for other APIs from registration of entry points in pageable space (initially application space) would help a little.

Revisiting some APIs and implementations, particularly Toolbox ones, which sat over applications could have helped to reduce the issues with the system ending up in an infinite loop of errors. Usually these were filters on applications which expected to be called back on the next poll, but if the application did something unexpected - like, say, crashing - the Toolbox might end up generating an error because the task it expected to handle was not there... and never recover. There are other related cases where the interfaces do not check their state as well as they might.

The running of programs within a TaskWindow can make things difficult to handle. The TaskWindow will preempt on either an SWI OS_UpCall 6 (which may include certain socket operations), when an output or input operation is performed which can block, or if it exceeds its time slot and is executing in user mode with a flat SVC stack. In particular, the use of output to trigger preemption means that in any code called from the TaskWindow application (that is, Module code) a call which writes to the screen might be a switching point, and might therefore mean that the module can be reentered - or never entered again.

At one point it was the case that *Commands within the TaskWindow could cause confusion because the commands assumed that they could never be reentered. *Help . and then closing a TaskWindow would cause problems, but wouldn't usually kill the machine. There had been the 'MessageTrans bug' which was essentially caused by MessageTrans descriptor being placed on the SVC stack within a TaskWindow, and the stack being switched out by an operation. As the TaskWindow was at the time chaining the MessageTrans descriptors, this resulted in a broken chain. Stewart Brodie (I believe) found that one, after it had caused a few years of frustration for people, myself included.

Improving the behaviour of the TaskWindow's preemption would have made a significant difference to these problems. Much of the legacy API usage, and collusion, in the TaskWindow had been replaced whilst I was fixing bugs with the code, but there were still too many things that weren't right about it.

Ideally it should be just replaced, and this comes from both the resilience angle, and from a flexibility point of view as well. 'Tasks', within RISC OS up to version 3 were managed by the Wimp and Kernel - Wimp managing the memory, and Kernel handling parts of the environment.

On RISC OS 3.5, these were rationalised a little so that the memory management that was performed within the Kernel as part of the 'Application Memory Block' work ('AMB'). This was later augmented with support for Lazy Task Switching.

Application space and environments

The changes to the management of application space helped, but needs to go a lot further. The legacy which remains from Arthur still hurts today, and whilst there was work going towards it, this was still a little way off. That said, each iteration of the system moved things in the right direction as components were rationalised, collusion removed and new APIs introduced.

A particular problem is that the interfaces that create an environment are entirely under the control of the application, but have privileged status. This means that the user mode application can take out the entire system. This is unacceptable, and was one of the major areas I was working towards removing. The primary reason for having privilege mode access was that the environment handlers needed to be entered in those modes.

The exception handlers are all entered in SVC mode. This was originally required so that the application could provide the necessary fix up necessary to implement any handling of the exceptions itself. This is no longer required - applications should not be handling any exceptions beyond reporting on them, which happens through the error handler. The situation is complicated slightly by the fact that exceptions might occur in SVC mode, making the USR mode registers inaccessible at the point of exception (and thus requiring the application to be entered in SVC mode in order to recover the correct registers). Or the exception might occur in the UND mode, handling an undefined instruction, which also has some separate registers.

Usually, though, the application cares only about its own state (and in the case of an exception in the SVC area, the abort save area can provide more information - not that that is a good solution). So the application should only be handling exceptions through the error handler (or another handler which explicitly deals with the exceptions), and may want to have access to the register set in the mode of the failure and from USR mode as well. This requires a change to all the application handling, but once we accept that the reason for the change is to allow better control and greater resilience in the future, it is not a hard call. A legacy interface can be retained overlaying the original, and such applications would not have the protection provided by those using the newer interface.

The Break Point environment handler has no purpose in the modern systems and it being able to be called in SVC mode again raises privileges unreasonably - again applications should not care about this; only debuggers. It really is a bit questionable whether the SWI OS_BreakPt interface has any useful purpose when there exists a dedicated BKPT instruction. Regardless, it is outside of the application's scope.

The escape handler is entered in SVC mode to indicate that an escape event has occurred, or that one has been cleared. Escape is just another signal, like many others that can be received by the executing program, but which should not have special preference. The interface needs to be rationalised really, as it is a reflection of a BBC interface which has far less meaning in the modern systems. This is also shown by the fact that TaskWindow was reworked to handle the escapes in a closer manner to the command line behaviour. The meaning and operation of the escape key needs to be rationalised, and whilst C and BASIC handle it differently (and the PRM suggest that it can be handled in a number of ways), the latitude given in this regard has meant that it is not handled consistently.

The event handler can be entered in either SVC or IRQ mode - it is just a tail handler for the Event vector. Again, this was an Arthur 'feature' which allowed applications to provide a handler which was not hooked through the vector chain. It is not appropriate for an application to do this in a multitasking environment. Events which it is watching for can occur outside the scope of the application environment which effectively means that the application misses them. You might argue that if you are writing a single tasking application you could use them safely, but that is an unsound assumption. There should be no time when you can be certain that you will remain the current application executing. If the OS is to move forward, this will need to be the case. Thus, they are useless.

The UpCall handler is the same (albeit entered in SVC mode and called by the tail handler for UpCallV), and the same rules apply.

The Unknown SWI handler has already been removed, as it is completely useless to the application.

The application space limit is... amusing. Its purpose is to provide an indication of the limit of space that the application has to use so that another application living in the space above it is not overwritten. This, too, is unacceptable. An application in application space should be the only application in that space. The 'solution' of moving the application up memory, and executing newly invoked applications in the space left - asking the new application not to overwrite the caller - filled a gap, But should be moved away from, and with changes to the way that applications are handled should be.

The exit handler is also fun, in that it takes the responsibility of restoring the state of the caller. So if it wants to it can just not return to the application that called it. Obviously the WindowManager uses this behaviour, but it is not a reasonable situation to be in where the application can control the machine.

So, there are some handlers that are needed by the application, some that are needed by the debugger and some that just should not exist. Replacements for the existing handlers, and new APIs would need to be defined - for example, you cannot just say that a handler will be entered in USR mode instead of SVC mode, as this both means that the USR mode registers might be lost, and that the stack which was previously in a guaranteed state might not be guaranteed to be in any useful state. These are not insurmountable, and just needs the code to be redesigned to handle cases better.

Over the Select releases, the environment handlers had all been collected together, and were referred to as a single entity within a structure through macros. This meant that anywhere that accessed the environment within the Kernel was both easy to find, and simple to change if it needed to handle different use.

Another problem that exists with the environment handlers is that changing them must be done atomically. This means disabling IRQs, installing the handlers, and re-enabling them again. That is not particularly reasonable either as you have to be really sure that all the operations you perform in between are safe and, in any case, applications should not be having any control over IRQs.

The 'currently active object' has caused confusion for many authors trying to write module applications. Module applications themselves are really just a way to share the code between applications. Whilst they should not be discouraged, there should be a better way to share the code between the applications.

That covers the environment handlers. The application space itself was mentioned above, and it goes hand in hand with the environment. An 'application' (an entity living and executing in User mode within the application space) has handlers for events that interest it (exceptions, error, etc), the space that it is executing in (the application block), and its communication with other applications (start thinking of signals and pipes - in a very simplified form).

The application may also have properties, such as the way in which it expects to run (think of a 26 bit application, one which expects to be able to run in a particular environment, for example without unaligned accesses, where the FP maths operations work in a particular way, or is able to use a different type of instruction set to the rest of the system, or even simple things like 'dereference 0 is safe'). These properties could be directly denied (for example, a processor unable to support an instruction set, even through emulation), or provide a transition layer so that what the application sees is what it expects, whilst the system runs whatever it likes. To do this kind of thing, though, it is a requirement that the application not have direct access to SVC mode - removing the interfaces from the environment goes a long way to prevent that need. It is accepted there are other ways that it could be necessary, but it is important to move in the right direction before we solve all the problems.

In many respects the properties would defined the 'personality' of the system, in a similar way that Linux does. In the same way that Linux does, it might also allow interfaces to change to reflect those that are expected by the application over those which have been updated or deprecated.

The insistence of many authors to be able to write raw assembler applications has been something of a problem for such environments - especially when such applications fail to handle all the cases properly. Raw assembler applications should still be discouraged, but with the correct updates, these applications could be made to work within the new application system without ever seeing SVC mode.

In order that applications be able to be recognised as such, they need to be identifiable. This had been a requirement since the StrongARM, which was more strongly enforced in Select 4 - and for which I took a lot of flak for, to the point of people saying it was the worst mistake for the Operating System. As I have said a few times, I stand by the decision completely. It was just one of many steps towards the improvement in the application system (as described in the Environment ramble, and here), and should have made significant inroads into making the system more controlled.

Ok, to return to what I was saying... the environment handlers would have been changed to provide only the access that the application needed. There would be a legacy interface, but that would be just a proxy through the new interfaces. The environment handlers would be controlled atomically, and I wanted a new module to take over this job. As well as handling the environment handlers, they would be associated with an application space, and the new module would control it. Control over the memory would remain in the Kernel (initially - I hadn't decided on whether that would remain the case ultimately), but the act of paging tasks in and out would be given to the new module.

The WindowManager would request task switches of the module, and the switch would take place under its control. Task contexts could be nested, allowing an application to spawn another application in a different context. This would prevent data sharing between applications, but would increase the control that the system had - removing it from the application. I have implied it above, but the removal of control from the application is vital to these changes. I do not really care that some authors might feel that they need that control, unless they can give strong and overriding arguments that negate all the issues that this solves. It is obviously important to retain backward compatibility, but that cannot be 'at any price' - a system which remains stuck in an early '90s design philosophy is doomed. I wanted to try to at least move on to a late '90s philosophy <laugh>.

Going back further, this would mean that the TaskWindow would do a lot less in terms of application handling. Nested applications within the TaskWindow would become new contexts. Because applications would not be moving themselves up memory, restricting the space more and more as they invoked further instances, and in any case would only be running in USR mode, it would be easier for the TaskWindow to kill off the applications - you should not get into a situation where closing a TaskWindow killed the machine (I realise there are many reasons for this happening, but the primary of these is poor exit handling of nested applications).

The environment handler would have a concept of input and output destination. These map (approximately) to the stdin and stdout in C applications (I had made no decision on stderr, because there is no direct equivalent in RISC OS terms). A pipe buffer could be implemented (no relation to PipeFS) which allows output to the channels to be buffered. These would be the primary resources for input and output, and would be able to be attached to applications - so the calling application could take the output from the one it invoked, properly piping data.

This is a more advanced feature, and it would have taken longer to orchestrate. Parts of the initial work on it were trialled with the SharedCLibrary, taking over parts of the standard RISC OS file redirection through its own buffered file access. Partly that work was to try to rationalise the redirection handling, but it was also the proving ground for changes to the redirection system. It was well in advance of any pipe environment changes, but needed to be done early because it was already needed, and it is a very complex system so being familiar with the corner cases is vital.

Because the environment was being managed by a separate module which provides all of the control, it could easily be updated to provide the ability to preempt the task. This would be useful for the execution of multiple threads within the application (trivial to add if there is a module managing the environment), and would allow the explicit preemption of the application (initially, for TaskWindow - but do not discount the possibility of Wimp2 style support).

With a single controlling point for applications, it becomes easier (though in no way trivial) to start introducing multi-processor support. I am going to cop out here, and say that I am not going to go into my thoughts on multi-processor support beyond this. I have thought about it, but the thoughts are all pretty hazy, and depend on the success of all of the stages up to here - it is difficult to plan for more so far in advance. That said, do remember that the rambles here are based on work that was planned when I stopped doing things in 2006 - who knows if any of this would have worked out by now.

The changes to the environment and application system could have been made to the SharedCLibrary initially, with its handling starting to use the new environment system for nested applications, and the new environment set up and tear down. As the RTSK in the SharedCLibrary is intended to take care of all of that work, and the rest of the application should not care (allowing for the existence of additional language description blocks), this should be pretty much transparent.

BASIC should be pretty trivial to update, because the code is (despite its age) relatively easy to follow, at least when it comes to the environment handlers. And (oh joy) it would give you the almost immediate ability to run *Commands and not have them destroy your running BASIC program or crash the system.

Parts of the system, like the Floating Point Emulator, might need some reworking in order to report/preserve their environment properly, and hardware Floating Point would need similar special treatment. But this is just part of the general 'preservation of state' that moves between the application spaces.

There is an argument that vector registrations within application space could be implicitly handled (think of SWI OS_DelinkApplication and SWI OS_RelinkApplication). However, this has the problem that it does not preserve the ordering, which could be addressed, but might be difficult to ensure. More importantly, it goes against the purpose of the abstraction - that the application not be encouraged to have any SVC entry points.

For later consideration there are issues like blocking of applications and application signalling (think of mutexes, and socket events or timers), but again, with a central point for these to be managed it becomes easier to do so. If correctly thought about, a Wimp message could just be an entry point within the environment, spawning new threads. I am not going to suggest that that is the 'right' thing to do, and I can see many, many reasons not to do that, but as a consideration it does make for an interesting train of thought - how many things would be broken by it, how many nice features could it allow, and so on.

In many respects this is a process model, as used by other systems. However, it differs significantly in being a development of the extant RISC OS interfaces, rather than trying to fit RISC OS into another system. It has been suggested many times that RISC OS just change its kernel to that of other systems. In general such suggestions show a naivete about the way in which RISC OS works and the huge number of things that would need to be rewritten to make things work.

In parallel with this, I wanted to consider the possibility for stub environments. These might not exist as complete environments with application space, but would provide the ability to handle failures more strongly within non-application space. For example, a module's failure in SVC mode, or IRQ mode, would need to be handled differently. Work had already begun to separate these errors - they would be reported as 'background' errors to the foreground applications. Still, this was wrong. Whilst recognising that these sorts of exceptions are not the foreground application's fault, it should not be penalised for their failure. The plan was to move the error handling away so that background errors were reported through a separate interface, and could even be automatically recovered from. That's not always going to be the case, but where problems were recognised, they should be able to be dealt with, without killing off any applications.

Quite whether these modules themselves would become applications which just have no application space when an interrupt occurred, or were purely stub environments, was something I'd not decided. However, mapping out the entirety of application space on an IRQ would be a trivial operation using the domain control. Obviously the remaining problems with domain aborts would need to be fixed, but most of these had been addressed as part of the abort trapping work. Certainly there was a need for more work in that area, and with more experimentation a balance could probably be found.