Search Results for: SMP

Multicore and Amiga: Present and Future

blog_devSymmetric Multi-Processing (SMP) is on the wishlist of AmigaOS users for quite some time now, and while progress has been made, we’re still not there yet.

To explain the progress, let us first look at the concept, and then point to where we are in the whole.

Concept: Threads, Cores and Processors

When looking at SMP support, we need to take actual processor technology into account. Older implementations used actual physical processors for SMP, that is, one processor could execute one instruction stream, and to achieve parallel execution, you would be able to plug in more processors. Usually and not surprisingly, this is rather limited in the amount of processors that can communicate with each other (although there were massively parallel machines that used a complex interconnection network for communication).

Later on, chip manufacturers added additional so called “cores” to one physical processor.

A very recent development is the ability of such individual “cores” to execute more than one instruction stream in parallel. We call those instruction streams threads. This technology is used in such CPUs as the Intel Core i7 (where it is named hyper-threading), or the Freescale e6500 core, which is used on the T-series CPUs from Freescale (up to the T4240, which has 12 physical cores with two threads each).

How to schedule tasks in SMP configurations

When looking at how to schedule Exec tasks and processes on a SMP system, we need to look at how much overhead is involved with scheduling. Currently, in single-CPU environments, Exec periodically interrupts execution and evaluates whether it should pick another task to run. Doing that with high frequency, say, 20 or 30 times per second creates the illusion of multiple tasks running in parallel.

The evaluation whether a new task should run or not is done based on the tasks priority, or depending on whether the task has something important to do. Of course, this takes time as well, and if the time taken for this evaluation becomes too long, the machine will take more time evaluating what to run than running what it is supposed to.

This gets worse if more execution units are in the system: If the evaluation would also need to ask other CPU cores whether they want to run something, the time for this processing would rise tremendously, to a point where adding more CPUs to the system would actually slow it down.

Therefore, the scheduling of Exec tasks and processes will remain something a single CPU will do for itself, even in multi-core. This will ensure a reasonable time is spent on this task.

So, how do multiple cores, threads and/or CPU’s come into play ?

We can easily monitor how loaded a core is by checking how many of the tasks we have ready actually got time to run. This is called the “load” of the CPU. If it spends a lot of time waiting for tasks to become available, the load is low. If a lot of tasks are waiting for their turn to run, the load is high.

In a multi-core system, some cores will have high load, and some low. To balance this, the system will look at the load of the individual core and determine whether there is a need to balance out the load among the CPUs or not. This balancing is triggered by the individual schedulers when they notice that their current workload is too big for them to handle. In this case, the overloaded core will migrate some of its tasks to other cores until it can again handle its workload.

Scheduling domains

In the previous section, we talked about balancing. Let’s take a closer look at this. A task running on one core is usually using the core’s resources. An important resource is caches: Level 1 caches (L1) contain data from memory very close to the CPU’s instruction units, so accessing this data is instantaneous. A level 2 cache (L2) is something that is below the Level 1 cache, is usually bigger, and access to it is slightly slower than accessing the level 1 cache. Some systems even have higher level caches below the L2 cache.

In a typical multi-core system, the L1 is exclusive to one core, while the L2 is shared among many cores. However, as in the scheduling example, more cores accessing one L2 means more overhead in communication with the L2 because some core might be waiting for another core to finish accessing the L2. The more cores access the same L2, the more likely such a stall is. Therefore, some processors with a high number of cores have multiple L2 with groups of cores sharing that L2 while another group shares another L2. We call those groups “clusters”. As an example, the T4240 has 12 cores, grouped in three clusters of four cores each, with each cluster sharing their own L2 and each core having a separate L1. In addition, there are cores that can run multiple threads at once. These threads even share the L1.

What we see here is a hierarchy of execution units. Clearly, moving a task from one thread to another thread on the same core is a rather cheap operation, the migrated task will not suffer from misses in the L1 since it still uses the same one. On the other hand, migrating from one core to another one means that the new core will not have access to the L1 of the previous core, and thus, migrating to this core will come at a slight initial performance cost. Similarly, migrating across to a new cluster will mean the task will lose the benefit of both the L1 and the L2, resulting in an even larger initial performance cost.

As you can see, moving a task to another execution unit will have different degrees of performance penalties. Note, these are only “initial performance” penalties, since the caches will gather the necessary data over time so that after some time, the caches will be fully available again to the task.

This hierarchy is in essence the hierarchy of “scheduling domains”. Scheduling domains define a cost associated with moving a task from one core to another. When balancing the load, the system will strive to minimize the cost of movement to ensure minimal performance loss. Migrating inside the current scheduling domain will always be cheaper than migrating to some core outside of the current scheduling domain, however, if all cores in the task’s scheduling domain are overloaded, the higher cost will need to be paid.

Pitfalls: The dreaded Forbid

Adding this functionality to AmigaOS is not without problems, naturally. There is one central part of the OS that has been around since its very beginning, and has been widely misused and misunderstood: The Exec function Forbid (and its counterpart, Permit). The problem with these functions is both semantic and practical. It has been documented as disabling task switching, and as enforcing the system to become single threaded.

If you look at this, these are one and the same on a single core system, but something completely different in SMP. Forbidding task switching in an SMP system will have a number of threads running in parallel. No CPU core will switch tasks, since it is forbidden to do so, but the system is not running single threaded at all.

This misunderstanding has lead to a lot of misuse. Forbid has been used to basically protect critical data structures from tampering with by other threads. Of course, this will work in a single core system: Prohibit task switching, and you can be sure that no one else will get to access that data. In a SMP situation, however, this does not stop anyone from accessing it.

So what’s the solution to this ? Keeping in mind that Forbid is mostly used to protect critical sections of code and data, the SMP enabled kernel will treat it just like that: Any core issuing a Forbid call will, simply put, write its number into a field somewhere in the system. It will do this “atomically”, meaning that the memory is only modified if no one else is competing for it. If it succeeds, it can proceed into the critical section. On the other hand, if there is already some other core in the critical section, or the write did not succeed, it will repeat this process until it succeeds, basically stalling the competing core until it is successful. This ensures that only one core at any time can enter into a Forbid state, and any subsequent core that wants to forbid will have to wait until its turn.

Of course, this is a simplified description of the process. Some other things are necessary to ensure fairness and equal distribution of access, to prevent one core from hogging this “lock” for too long, and so on, but the basic process works like this.

Where are we now ?

The development of SMP support has been separated into several distinctive steps. The first step was to rewrite the scheduler in C for easier accessibility. In the very end, this step might be reversed again, rewriting the then SMP capable scheduler back into assembly language. The second, more fundamental step was to decouple the scheduler from its current data structures. As you might know, ExecBase contains a lot of list for task that are ready, or waiting for a signal.

This has now been achieved. The current development build uses a scheduler that no longer uses the original AmigaOS data structures, but a structure that is replicated for each core.

The next step is to have each core in the development system (currently, the X1000) to run the scheduler. Test code will then start tasks on the different cores and see how they behave. We have already experimented with this and the results look promising. The tests basically showed that the lockout mechanism for Forbid works as planned.

As a final step, the balancing will be introduced, which then finalizes the first implementation of SMP support in AmigaOS.

Future plans

There are several possibilities to chose from once the first implementation is done. First of all, the scheduling algorithm is still the same as the one used by current Exec, a priority modified Round-Robin. Naturally, this is an algorithm that can be improved upon. There are several other implementations that come to mind, like the O(1) scheduler, multilevel feedback queues or even the Brain Fuck Scheduler.

Also, the balancing algorithms are candidates for improvement. The system might record scheduling data and compute typical user profiles that can be pre-selected, like, for example, the ability to determine when to balance, and how aggressively balancing is carried out.

HDAudio driver is complete!

blog_devI am happy to announce the release of the finished HDAudio driver for the AmigaOne X1000!

The driver now supports recording as well as playback. It also now supports S/PDIF optical output.

There have been questions about whether full “32 bit” audio really makes a difference. I’d like to dig a little deeper to better understand the technical specifications.

There are two primary factors that contribute to the quality of a digital sound recording. One is resolution, or how many bits per sample, and the other is sample rate, commonly 44100 or 48000 samples per second.

As you look at the waveform of a sound recording, these two numbers determine the vertical and horizontal resolution of the wave.

I’ll begin with the “bit width” or vertical resolution.

The original Amiga’s sound output supported four channels at eight bits of resolution. Eight bits means there are two hundred and fifty six possible vertical “steps” that can be used as the wave is generated. Now we spread those steps across a -2 volt to +2 volt span and we get 0.015625 volts per step.

At the time of the Amigas introduction, that was a pretty fair sound playback. But only 256 steps is not as “high fidelity” as we might like. As a comparison, Compact Disk Audio is reproduced at 16 bits per sample. This makes for a big improvement in resolution. 16 bits offers us 65536 possible “steps” to spread across the -2 volt to +2 volt range. Now the step size is 0.0000610351562 volts per “step” of vertical resolution. So 16 bit audio is a HUGE increase in accuracy.

Getting back to our driver, AHIPrefs offers both 16 Bit HiFi and 32 bit HiFi modes. But I’ll bet that neither of those modes gives exactly what you might expect. As AHI mixes lots of different sounds together, possibly each sound with it’s own volume and pan settings, it can be useful to have more resolution available to work with. Here’s the clue: ALL AHI modes that say “HiFi” are sending 32 bit data out to the sound device! The “16” and “32” only describe what goes IN to the AHI mix routines. if it says HiFi, you WILL get 32 bit output to your card!

Or will you? In truth, while AHI is making it’s calculations using 32 bit registers and 32 bit math, it only promises 24 bits of accuracy. Is this anything to be concerned about? Not at all. I’ll tell you why. 24 bit samples will resolve to a “step size” of 0.0000002384185 volts per step. Wow! That is about one quarter of a microvolt. Those with an electronics background can probably tell you, that attempts to accurately work at those levels are just ridiculous. We have reached an accuracy that is beyond the ability of our amplifiers and speakers to reproduce. Put simply, 24 bits is the reasonable limit of current technology, or at least affordable technology.

So our 32 bit samples are flying out of AHI and in to the HDAudio codec. While the “container” is 32 bits wide, even the “high definition audio codec” that we have in the AmigaOne X1000 only resolves the top 24 bits. So it seems that in the end, both AHI and HDaudio agree that 24 bits is the reasonable limit for now.

And how about sample rate or the “horizontal” resolution?

How rapidly a sound is sampled and played back can also have a BIG impact on sound quality. It all starts with the Nyquist-Shannon sampling theorem or more commonly the Nyquist theorem. It’s pretty simple. As you record an audio signal, you must sample at at least twice the frequency of the highest pitch being recorded. Any sound that is higher than half the sampling frequency will be converted to noise and nasty noise at that.

So how high do we need? It is generally held that human hearing range is from 20 Hz (cycles per second) up to 20000 Hz. So any frequency above 40000 should be great right? Well Yes and No.

One simple problem is that we still must filter out all sound above half the sample frequency, and most frequency dependent volume controls (graphic equalizers) work with gradual slopes. There is no “hard cutoff” at a certain frequency, so we need a bit of headroom.

But there is another reason. As a high frequency sound approaches the Nyquist rate, we are only sampling about once per half-cycle. While this will reproduce the frequency of the original, it will do it at a bare minimum of accuracy. In other words, as frequencies get higher, they get less detail.

So what does it really matter?
Audio CDs play back at 44100 Hz. Not bad at all.
Television/DVD audio is usually at 48000 Hz. Nice.
With the HDAudio chip in the X1000 we support both of those frequencies.
We also support 88200, 96000, 176400, and 192000.
So we can double or quadruple the sample rates of common media!

At first, I really thought it was all a numbers game, but when developing the driver, I can actually hear the noise decrease noticeably as the playback rates went up!

And that is where I’ll leave off. This was enough of a lesson for one day. I am very happy that I could contribute to the completion of this driver. And the chance to “raise the bar” regarding sound capability was really very nice icing on the cake.

Like many of us, I have been using Amigas for a long time. Today, right here in front of me is an Amiga that supports high definition audio, a modern high performance video card. It uses standard, off the shelf keyboard, mouse, monitor and many USB accessories as well. Most of these we unheard of in the classic days. But with all the new and shiny, it is still AmigaOS to the core.
🙂