The Desktop: The buggy bits | RISC OS Rambles

WindowManager

UpCall holes

One of the more amusing problems with the Window Manager was found because of problems with applications using UpCall environment handlers. They're pretty rare to use, mostly because application should work in a multitasking environment, and the UpCall handler might not be called if you weren't paged in. For example, using it to trap socket notifications would be futile in a Wimp or TaskWindow task.

However, if you did use them, you might find that your application crashed. The reason is that a hole in the handling of the environment handler switching that the WindowManager performs. It doesn't change the handler back to the default during part of the task switch, so there is a hole during which any events which were triggered through that handler could cause a crash.

I believe that UnixLib developers found the problem, and it was fixed with a small change to the Wimp, but it could have been triggered by many applications. It's pretty unlikely that you would encounter a case where you could usefully use an UpCall handler other than the default. Despite the UpCalls being intended for messages to the application, the fact that the system is invariably run with multiple tasks which aren't paged in and therefore do not have the environment handler to call, means that they are pretty useless. Unless you intend for your application to be the only one running, of course.

There are a couple of other places in the Wimp other than task switching where problems can occur with UpCalls, and they needed minor changes as well, to ensure that they wouldn't be affected in the same way. The Wimp is pretty good in its handling of task switches, though, so it was pretty easy to locate the problematic cases.

Unreliable Alt-break handling

There is another lovely race condition in the Kernel, which manifests itself through the Alt-break operation failing in a bad way (usually killing the machine, or random tasks). Alt-break usually displays an error box asking if you want to kill the current task, or offers the opportunity to kill another task, and will cycle through them. It is implemented through the reasonably obvious methods - trap the key press, trigger a transient callback to display the dialogue, and then either page in the task and kill it, or just kill it if the task is already paged in.

However, there are problems with non-transient callbacks. Non-transient callbacks can be requested by applications, and cause the system to return to the callback handler the next time that the system exits back to user mode. Usually these are used to perform task switches, for example in TaskWindow, but they can also be used for other purposes. The callback handler is called after any transient callbacks which have been set up. It should be obvious to see that there is a race condition there.

It is particularly bad in a heavily loaded system which is performing a lot of transient callbacks - as is the case for most Internet operations.

As well as correcting the 'return to user mode' handling, I remember adding some more checking and safety within the WindowManager to ensure that it shouldn't cause issues on other systems. It might not seem like the problem is that significant, but it was important enough to make 'Use Alt-break' the second in the list of 'Things Not To Do'.

Message registrations bug

Whilst trying to handle filtering of messages for applications, I found a rather strange bug that appeared to have been there since RISC OS 3. When an application starts up it registers itself with SWI Wimp_Initialise and on RISC OS 3 can supply a list of messages. If it doesn't supply a list of messages, it will receive no messages at all (assuming that it is a task which registered itself with a version number > 310). If it supplies a list that contains no entries, it will receive every message that is sent to it (either directly or broadcast). If it supplies a list which contains a list of message numbers, it only receive those messages.

It is somewhat strange and I am sure there was a good reason for this way of working. This filtering prevents tasks from handling other messages reliably. Normally, if you were to install a filter on an application, it would be because you wished to perform additional operations above it - usually involving new messages that the task did not know about. If the message you were interested in wasn't in the list, it wasn't coming in. So the filter would call the new SWI Wimp_AddMessages to inform the Wimp that it was actually interested in more messages than the application had originally said.

There is also a SWI Wimp_RemoveMessages to tidy up when the filter is done, but as the message registrations are not reference counted, this could be destructive. So, either the filter has to keep handling the operations safely after it has completed handling them, or assume that the application will ignore the messages which it didn't know about. In general, we assume the latter, although (as I have mentioned in an earlier ramble about problems with DeskLib) even this behaviour shouldn't really be relied on.

Anyhow, that's what you would usually do. There was no way to read the message list for the application, so it isn't possible to check what the current state of message handling is - and usually you don't care. However, in the rare case that the application wants to receive all messages the correct implementation is to pass an empty list. I say it is rare because just about every application from RISC OS 3 onwards has supplied a list of the messages it cares about. There are only a few rare occasions when you need to receive all the messages, and those are usually only for diagnostic purposes.

If you had an empty list, and you add messages to that list, it changes the behaviour in the wrong direction - from accepting all messages (empty list), it now accepts only a subset of messages (the messages that were added). It is an amusing case - and one that is explicitly coded around in the Toolbox filters. The 'fix' (if you can call it that) is to just discard the message list if we are already accepting all messages. It isn't that complicated, but it goes to show two things. Firstly, if you have an odd interface it will come back and bite you. Secondly, rare cases do occur and you do have to consider them - and curse that you didn't .

I'm not sure whether I handled the list becoming empty due to calls to SWI Wimp_RemoveMessages, but that's a rare case (because of the problems mentioned about the messages not being reference counted) so it probably isn't a worry. And of course, by my second point above, I probably should have considered it even still .

Transient window crash

There was a quite fun crash that I found had been there a little while and could do very bad things to the system. I forget how I found it, but my recollection is that if you have the focus in a transient window and the application exits, and there are other windows owned by the application, the Wimp will crash.

The reason was something to do with registers being corrupted whilst the transient window was being destroyed, which meant that the following window destroy code went wrong. I'm pretty sure I had introduced the problem in an earlier version, but it was rather embarrassing to find that just quitting an application could make things go bang.

Fortunately, such crashes were less frequent as time went on. That usually means one of two things - either a) you've become used to the way that things work and unconsciously avoid the problematic cases, or b) you've fixed things that well. It's rarely the latter.

Shaded icons problems

There was a longstanding issue with moving the caret when it had been set to an icon which had been shaded. This didn't happen too often because usually if you could place the caret in a faded icon in your code, you spotted that the machine crashed and so fixed it . However, it still wasn't great to have in the system.

The problem was simply that when the cursor keys are used to move up and down, or tab/shift-tab used to move between the icons the code would look for the next icon along (numerically) which is a member of the same ESG, which is writable and which isn't faded. When it reaches the start (or the end, depending on the direction) of the list of icons, it moves to the opposite end. However, if there were no icons that could take the caret, the Wimp would keep trying, effectively hanging the machine.

The fix was simply to check if we had wrapped, but it was a little embarrassing that it was so easy (again) to kill the system. I may not have created the problem, but I always found these issues embarrassing.

Returning shaded icons

For reasons I don't remember - and am not sure if I ever knew - Castle included a special flag in the Extended Window Flags which would change the behaviour of the SWI Wimp_GetPointerInfo call. When the flag was set, it make the SWI return shaded icons. I'm not sure why the extension wasn't made to the SWI directly, as that would be less invasive for filters and applications which didn't expect to get back such icons. Really, it's a downright stupid way of performing feature selection.

I can imagine an application checking whether a writable icon was under the pointer and if so setting the focus to it - the application wouldn't need to check whether the icon was shaded because the SWI wasn't expected to return them. And if, having set the focus to a shaded icon the user then tried to use the cursor keys on a system that hadn't had the bug fixed... well badness ensues.

In any case, in the interests of parity, the 'feature' was duplicated. It wasn't useful for many things, but I have a feeling that it meant that WindowScroll had to subsequently explicitly check whether the icon the pointer was over was shaded or not. I regret it a little, as I felt it to be a dangerous interface, but I was more concerned with ensuring that the features were complete. As it was commonly thrown at us that we didn't support features from their branch of the system, it was easier to go with it. Ah well, we're wrong now and then.

Window stack problems

Pinboard had gained the ability to toggle itself to the front of the screen, which was intended to make it a more useful accelerator for finding applications. A little while after this, there were reports of strange behaviour, where windows could end up being either lost 'behind' the Pinboard, or hidden windows could become visible again.

The Pinboard window is a 'back window' - it appears behind any other window. The windows have a number of flags, one of which indicates that the window is a 'back window'. It was always intended that such windows remain at the back of the stack, and fill the screen. Any window could be moved to the front of the stack, including a 'back window' - which would bring it to the actual front of the screen. This was something of a problem as the internal window search code used to determine visibility, and other stacking features, treated 'back windows' as special. Hiding windows would move them behind the 'back window', and if the it wasn't at the back of the stack this could cause some odd effects.

With RISC OS 4, another flag had been introduced, to indicate that a window was a 'foreground window'. The intention, from the documentation supplied with RISC OS 4, had always been to have distinct stacks - the regular window stack, the 'foreground window' stack, and the 'background window' stack. That wasn't really the way that it was implemented, and in investigating it was clear that the 'foreground window' stack also had inconsistent behaviour. With this in mind, the code which managed the opening of windows was updated so that the order worked in a consistent way.

Opening a window at the 'front' (behind handle -1) would move it to the front of its stack. Opening a window at the 'back' (behind handle -2) would move it to the back of its stack. If you wanted to change the stack that the windows lived in, you needed to use the extended SWI Wimp_OpenWindow calls which allow the manipulation of the window flags.

The stack operations were all handled by a table which defines the operations that can be performed and what they mean, so it was pretty simple to extend. There had to be a special case for the iconised windows, because they needed to appear behind the magic internal back-window (the grey background that you see if there are no other windows to obscure it).

The number of applications affected by this would have been tiny, and the new behaviour was significantly safer - you could no longer lose windows in the stack by performing quite normal operations.

It might not have been a big feature, but as 'foreground windows' had been part of the RISC OS 4 feature set which was really quite useful, there was a good likelihood that applications would use them. Users would be a little put out if their windows were lost just because of the feature!

Inefficient pause zones and error boxes

The pause zones are places where, if the mouse is in them, an automatic scroll will take place. They are usually found at the edges of windows during a drag. These regions are generally quite efficient, as they are handled entirely by the WindowManager. The IconBar had always scrolled horizontally, and had its own pause zone handling, which significantly predated that used by the drags.

If there were sufficient icons on the IconBar that it overflowed, it would be able to scroll left and right. Holding the mouse at either edge would cause it to scroll. My machine had a hardware problem (which was common to hear when you plugged speakers in, in my experience), the symptom of which was that you could 'hear' the system activity as interference through the speakers. I believe that there is some interference from the data bus, or processor, or whatever that gets picked up by the sound system. Anyhow, it was obvious when the machine was 'busy' that there would be a little white noise whine from the speakers.

This whine happened sometimes and wasn't too surprising as the machine was often busy. However, if the machine wasn't expected to be busy, the whine was an indication that something was amiss. The IconBar pause zone was one such occasion. Sitting the mouse at the right, or left, of the IconBar, even when there was nothing for the IconBar to do - it couldn't scroll - would cause the whine.

A little investigation showed this to be a failure to early-exit if there wasn't anything to do during the processing. Instead of exiting, the code would roll on, calculating the acceleration and the new position based on how long the pointer had been there, and then issuing a new request to reopen the IconBar in the new location. The new location would be ineffective because the window's scroll was already at its furthest extent. Removing the extra calculations removed the whine, and meant that the system would idle because it had no work to do.

There was a similar effect in the error box. When an error message was displayed on the screen, outside the desktop, the system busy-waits for a response. This is obviously a bad thing to do. During some of the work to reduce the places where the system was running around madly doing nothing, the error box was updated in a few ways - when the 'error box' was being shown at the command line (that is, a text prompt) the Wimp would trigger user mode callbacks. This meant that at the command line an error message that required a response would be able to let callback handlers get called. That might not seem too useful as we don't have preemption at the command line (errors in TaskWindows are strictly in the desktop, so appear as real interactive boxes), but the same mechanism (environment handler callbacks) is used to allow Alt-break to terminate a task.

In addition, the system would issue the SWI Portable_Idle call to put the system into idle state. Aside from stopping the whine (!) this was the main reason to address these areas. Any time the system could be known to be doing pointless work it was possible to remove the work and (maybe) free up battery life.

High priority pollwords

There was a very fun bug with was reported quite late on during Select development involving the High Priority Pollwords. These were special pollwords which were checked before any other operations, including the message dispatch. Because a register had not been updated correctly it was possible that the system could abort, whilst in a different task to the one that set the pollword.

Fortunately, very few applications (if any) used the feature, and it wasn't likely that it would be hit by people. But the bug meant that the use of the High Priority Pollwords was likely to be fatal. Which is quite amusing, really.

Name/Nickname
Email address
Date	Mon, 4 Feb 2013
Comment