Entries Tagged 'hardware' ↓

Amdahl's law in reverse: the wimpy core advantage

Once a chip’s single-core performance lags by more than a factor to two or so behind the higher end of current-generation commodity processors, making a business case for switching to the wimpy system becomes increasingly difficult.

– Google's Urs Hölzle, in "Brawny cores still beat wimpy cores, most of the time"

Google sure knows its own business, so they're probably right when they claim that they need high serial performance. However, different businesses are different, and there are advantages to "wimpy cores" beyond power savings which are omitted from the "brawny/wimpy" paper and which are worth mentioning.

It is a commonplace assumption that a single 3 GHz processor is always better than 4 processors at 750 MHz "because of Amdahl's law". Specifically:

  • Some of your tasks can be parallelized to run on 4 cores - these will take the same time to complete on the two systems.
  • Other tasks can only run on one core - these will take 4x more time to complete on the 750 MHz wimpy-core system.
  • Some tasks are in-between - say, a task using 2 cores will take 2x more time to complete on the 750 MHz system.

Overall, the 3 GHz system never completes later and sometimes completes much earlier.

However, this assumes that a 3 GHz processor is consistently 4x faster than a 750 MHz processor. This is true in those rare cherished moments when both processors are running at their peak throughput. It's not true if both are stuck waiting for a memory request they issued upon a cache miss. For years, memory latency has been lagging behind memory throughput, and the 4x faster processor does not get a 4x smaller memory latency.

Assuming the latency is the same on the two systems - which is often very close to the truth - and that half of the time of the 750 MHz system is spent waiting for memory when running a serial task, the 3 GHz CPU will give you a speed-up of less than 2x.

What if the task can run in parallel on 4 cores? Now the 750 MHz system gets a 4x speed-up and completes more than 2x faster than the 3 GHz system!

(This assumes that the 4 750 MHz cores are not slowed down due to them all accessing the same memory - which, again, is very close to the truth. Memory bandwidth is usually high enough to serve 4 streams of requests - it's latency which is the problem. So having several cores waiting in parallel is faster than having one core waiting serially.)

A slow, parallel system outperforming a fast, serial system - Amdahl's law in reverse!

Is memory latency unique?

In a way it isn't - there are many cases where faster systems fail to get to their peak throughput because of some latency or other. For instance, the cost of mispredicted branches will generally be higher on faster systems.

However, most other latencies of these kinds are much smaller - and are handled rather well by the mechanisms making "brawny" cores larger and deserving of their name. For example, hardware branch predictors and speculative execution can go a long way in reducing the ultimate cost of mispredicted branches.

"Brawny" speculative hardware is useful enough to deal with memory latency as well, of course. Today's high-end cores all have speculative memory prefetching. When you read a contiguous array of memory, the core speculates that you'll keep reading, fetches data ahead of the currently processed elements, and when time comes to process the next elements, they're already right there in the cache.

The problem is that this breaks down once you try to use RAM as RAM - a random access memory where you don't go from index 0 to N but actually "jump around" randomly. (Theoretically this isn't different from branch prediction; in practice, large, non-contiguous data structures not fitting into caches and "unpredictable" for prefetchers are much more common than unpredictable branches or overflowing the branch history table.)

Hence we have pledges like this computer architect's kind request to stop using linked lists:

Why would anyone use a linked-list instead of arrays? I argue that linked lists have become irrelevant in the context of a modern computer system:

1. They reduce the benefit of out-of-order execution.

2. They throw off hardware prefetching.

3. They reduce DRAM and TLB locality.

4. They cannot leverage SIMD.

5. They are harder to send to GPUs.

Note that problems 1, 2 and 4 "disappear" on wimpy cores - that is, such cores don't have out-of-order execution, hardware prefetchers or SIMD instructions.  So a linked list doesn't result in as many missed performance opportunities as it does on brawny cores.

"Why would anyone use linked lists"? It's not just lists, for starters - it's arrays of pointers as well. "Why array of pointers" seems exceedingly obvious - you want to keep references instead of values to save space as well as to update something and actually get the thing updated, not its copy. Also you can have elements of different types - very common with OO code keeping arrays of base class pointers.

And then using arrays and indexing into them instead of using pointers still leaves you with much of the "access randomness" problem. A graph edge can be a pointer or an index; and while mallocing individual nodes might result in somewhat poorer locality than keeping them in one big array, you still bump into problems 1, 2 and 4 for big enough node arrays - because your accesses won't go from 0 to N.

Of course you could argue that memory indirection is "poor design" in the modern world and that programmers should mangle their designs until they fit modern hardware's capabilities. But then much of the "brawny vs wimpy" paper's argument is that brawny cores mean smaller development costs - that you get high performance for little effort. That's no longer true once the advice becomes to fight every instance of memory indirection.

It's also somewhat ironic from a historical perspective, because in the first place, we got to where we are because of not wanting to change anything in our software, and still wishing to get performance gains. The whole point of brawny speculative hardware is, "your old code is fine! Just give me your old binaries and I'll magically speed them up with my latest and greatest chip!"

The upshot is that you can't have it both ways. Either keep speculating (what, not brawny/brainy enough to guess where the next pointer is going to point to?), or admit that speculation has reached its limits and you can no longer deliver on the promises of high serial performance with low development costs.

Formalizing "reverse Amdahl's law"

…is not something I'd want to do.

Amdahl's law has a formal version where you take numbers representing your workload and you get a number quantifying the benefits a parallel system might give you.

A similar formalization could be done for "reverse Amdahl's law", taking into account, not just limits to parallelization due to serial bottlenecks (what Amdahl's law does), but also "limits to serialization"/the advantage to parallelization due to various kinds of latency which is better tolerated by parallel systems.

But I think the point here is that simple formal models fail to capture important factors - not that they should be made more complex with more input numbers to be pulled out of, erm, let's call it thin air. You just can't sensibly model real programs and real processors with a reasonably small bunch of numbers.

Speaking of numbers: is time spent waiting actually that large?

I don't know; depends on the system. Are there theoretically awesome CPUs stuck waiting for various things? Absolutely; here's that article about 6% CPU utilization in data servers (it also says that this is viewed as normal by some - as long as RAM and networking resources are utilized. Of course, by itself the 6% figure doesn't mean that brawny cores aren't better - if those 6% are all short, serial sprints letting you complete requests quickly, then maybe you need brawny cores. The question is how particular tasks end up being executed and why.)

Could a "wimpy-core" system improve things for you? It depends. I'm just pointing out why it could.

From big.LITTLE to LITTLE.big, or something

There's a trend of putting a single wimpy core alongside several brawny ones, the wimpy core being responsible for tasks where energy consumption is important, and the brawny ones being used when speed is required.

An interesting alternative is to put one brawny core and many wimpy ones - the brawny core could run the serial parts and the wimpy cores could run the parallel parts. If the task is serial, then a brawny core can only be better - perhaps a lot better, perhaps a little better, but better. If the task is parallel, many wimpy cores might be better than few brawny ones.

(I think both use cases fall under ARM's creatively capitalized "big.LITTLE concept"; if the latter case doesn't, perhaps LITTLE.big could be a catchy name for this - or we could try other capitalization options.)

Hyperthreading/Barrel threading/SIMT

…is also something that helps hide latency using coarse-grain, thread-level parallelism; it doesn't have to be "whole cores".

Parallelizing imperative programs

…is/can be much easier and safer than is commonly believed. But that's a subject for a separate series of posts.

Summary

  • Faster processors are exactly as slow as slower processors when they're stalled - say, because of waiting for memory;
  • Many slow processors are actually faster than few fast ones when stalled, because they're waiting in parallel;
  • All this on top of area & power savings of many wimpy cores compared to few brawny ones.

Will OpenCL help displace GPGPU? Parallella, P2012, …

OpenCL is usually perceived as a C dialect for GPGPU programming - doing "general-purpose" computations (not necessarily graphics) on GPU hardware. "It's like Nvidia's CUDA C, but portable".

However, OpenCL the language is not really tied to the GPU architecture. That is, hardware could run OpenCL programs and have an architecture very different from a GPU, resulting in a very different performance profile.

OpenCL is possibly the first programming language promising portability across accelerators: "OpenCL is for accelerators what C is for CPUs". Portability is disruptive. When hardware vendor A displaces vendor B, portable software usually helps a great deal.

Will OpenCL - "the GPGPU language" - eventually help displace GPGPU, by facilitating "GP-something-else" - "general-purpose" accelerators which aren't like GPUs?

We'll discuss this question on general grounds, and consider two specific examples of recent OpenCL accelerators: Adapteva's Parallella and ST's P2012.

Why displace GPGPU?

First of all, whether GPGPU is likely to be displaced or not - what could "GP-something-else" possibly give us that GPGPU doesn't?

There are two directions from which benefits could come - you could call them two opposite directions:

  1. Let software (ab)use more types of special-purpose accelerators. GPGPU lets you utilize (abuse?) your GPU for general-purpose stuff. It could be nice to have "GPDSP" to utilize the DSPs in your phone, "GPISP" to utilize the ISP, "GPCVP" to utilize computer vision accelerators likely to emerge in the future, etc. From GPGPU to GP-everything.
  2. Give software accelerators which are more general-purpose to begin with. GPGPU means doing your general-purpose stuff under the constraints imposed by the GPU architecture. An OpenCL accelerator lifting some of these constraints could be very welcome.

Could OpenCL help us get benefits from any of the directions (1) and (2)?

(1) is about making use of anal-retentive, efficiency-obsessed, weird, incompatible hardware. It's rather hard, for OpenCL or for any other portable, reasonably "pretty" language.

OpenCL does provide constructs more or less directly mapping to some of the "ugly" features common to many accelerators, for example:

  • Explicitly addressed local memory (as opposed to cache)
  • DMA (bulk memory transfers)
  • Short vector data types to make use of SIMD opcodes
  • Light-weight threads and barriers

But even with GPUs, OpenCL can't target all of the GPU's resources. There's the subset of the GPU accessible to GPGPU programs - and then there are the more idiosyncratic and less flexible parts used for actual graphics processing.

With accelerators such as DSPs and ISPs, my guess is that today most of their value - acceleration ability - is in the idiosyncratic features that can't be made accessible to OpenCL programs. They could evolve, but it's a bit far-fetched and we won't dwell on it now. At their current state, OpenCL is too portable and too "pretty" to map to most accelerators.

What about direction (2)? (2) is about making something that's more efficient than CPUs, but as nice and flexible as possible, and more flexible than GPUs.

As a whole, (2) isn't easy, for various reasons we'll discuss. But if we look, in isolation, at OpenCL the language, then it looks like a great language for targeting "faster-than-CPUs-and-more-flexible-than-GPUs" kind of accelerator.

What could such an accelerator give us that GPUs don't?

One important feature is divergent flow. GPUs are SIMD or SIMT hardware; either way, they can't efficiently support something like this:

if(cond(i)) {
  out[i] = f(i);
}
else {
  out[i] = g(i);
}

What they'll end up doing is, essentially, compute f(i) and g(i) for all values of i, and then throw away some of the results. For deeply nested conditionals, the cost of wasted computations can make the entire exercise of porting to a GPU pointless.

We'll now have a look at two OpenCL-compatible accelerators which promise to efficiently support divergent threads - or outright independent threads doing something completely unrelated. We'll briefly compare them, and then discuss some of their common benefits as well as common obstacles to their adoption.

Adapteva's Parallella

Actually, the chip's name is Epiphany - Parallella is the recently publicized name of Adapteva's planned platform based on Epiphany; anyway.

Adapteva's architecture is a 2D grid of processors with a mesh interconnect. To scale, you can have a chip with more cores - or you can have a 2D grid of chips with some of the inter-core communication seamlessly crossing chip boundaries. Each of the (scalar) processors executes its own instruction stream - no "marching in lockstep", fittingly for divergent flow.

There are no caches; a memory address can map to your local memory, or the local memory of some other processor in the grid, or to external memory. Access latency will vary accordingly; access to local memories of close neighbors is quicker than access to far neighbors. All memory access can be done using either load/store instructions or DMA.

(Note that you can reach far neighbors - unlike some more "fundamentalist" proposals for "2D scalability" where you can only talk to immediate neighbors, period. I think that's over the top; if you want to run something other than the game of life, it's awfully handy to have long communication paths - as do most computers ranging from FPGAs to neurons, some of which have really long axons.)

Stats:

  • 32K memory per core (unified - both instructions and data)
  • 4 banks that can help avoid contentions between loads/stores, instruction fetching and DMA traffic
  • 2-issue cores (one floating point operation and one integer or load/store operation)
  • 800 MHz at 28nm using low-power process, as opposed to high speed (my bet that it's hard to top 800 MHz at 28nm LP - any evidence to the contrary?)
  • ~25mW per core, 2W peak power for a full chip with 64 cores
  • 0.128 mm^2 per core

Sources: BDTI's overview and various documentation from adapteva.com.

ST's Platform 2012

The P2012 architecture is also, at the top level, a grid of processors with a mesh interconnect. One stated motivation is the intra-die variability in future process nodes: some cores will come out slower than others, and some will be unusable.

It is thus claimed that a non-uniform architecture (like the ones we have today - CPUs and a boatload of different accelerators) will become a bad idea. If a core happens to come out badly, and it's not like your other cores, you have to throw away the entire chip. Whereas if cores are all alike, you leave the bad ones unused, and you may still have enough good ones to use the chip.

Interestingly, despite this stated motivation, the P2012 is less uniform and has higher granularity than Epiphany. Firstly, there's a provision for special-purpose accelerators in the grid. Secondly, the top-level mesh connects, not individual cores, but clusters of 16 rather tightly-coupled cores (each with its own flow of control, however - again, good for divergence).

Similarly to Epiphany, data is kept in explicitly addressed local memory rather than cache, and you can access data outside the cluster using load/store instructions or DMA, but you'll pay a price depending on the distance.

However, within a cluster, data access is uniform: the 16 cores share 256K of local data memory. This can be convenient for large working sets. Instructions are private to a core - but they're kept in a cache, not a local memory, conveniently for large programs.

Stats:

  • 32K per core (16 cores with 16K I-cache per core and 32 8K data memory banks)
  • 1MB L2 cache in a 4-cluster, "69-core" chip (presumably, (16+1)x4+1 - one extra core per cluster and per chip)
  • 2-issue cores (I failed to find which instructions can be issued in parallel)
  • 600 MHz at 28nm (process details unclear)
  • 2W for the 69-core chip
  • 0.217 mm^3 per core (3.7 mm^2 per (16+1)-core cluster), not accounting for L2 cache

Source: slides, slides, slides.

Parallela vs P2012: a quick comparison

Each of the architectures can have many different implementations and configurations. It seems fair to compare a 28nm 64-core Epiphany chip with a 28nm 69-core P2012 chip (or at least fair as far as these things go). Each system has its own incompatible native programming interface, but both can also be programmed in OpenCL.

Here's how Epiphany compares to P2012:

  • Power: 1x (2W)
  • Core issue width: 1x (2-issue)
  • Local memory size: 1x (32K per core)
  • Frequency: 1.33x (800/600)
  • Core area efficiency: 1.7x (0.217/0.128)

I think it's a fine achievement for Adapteva - a 5-people company (ST has about 50000 employees - of course not all of them work on  the P2012, but still; Chuck Moore's GreenArrays is 18 people - and he's considered the ultimate minimalist, and develops a much more minimalistic product which, for instance, certainly won't run OpenCL programs).

This is not to say that these numbers are sufficient to compare the architectures. For starters, we assume that the power is the same, but we can't know without benchmarking. Energy consumption varies widely across programs - low power process brings leakage down to about zero at room temperature, so you're left with switching power which depends on what code you run, and on what data (multiplying zeros costs almost nothing compared to multiplying noise, for instance).

Then there are programming model differences, ranging from the extent of compliance of floating point to the IEEE standard to the rather different memory models. In the memory department, the ability of P2012 cores to access larger working sets should somewhat negate Epiphany's raw performance advantage on some workloads (though Epiphany cores might have lower latency when accessing their own banks). But then two different 2-issue cores will generally perform differently - you need thorough benchmarking to compare.

So what are these numbers good for? Just for a very rough, ballpark estimation of the cost of this type of core. That is, a core which is flexible enough to run its own instruction stream - but "low-end" enough to burden the programmer with local memory management, and lacking much of the other amenities of full-blown CPUs (speculative prefetching, out-of-order execution, etc.)

Our two examples both point to the same order of magnitude of performance. Let's look at a third system, KALRAY's MPPA - looking more like P2012 than Epiphany, with 16-core clusters and cores sharing memory.

At 28nm, 256 cores are reported to typically consume 5W at 400 MHz. (Adapteva and ST claim to give worst case numbers). That's 20mW per core compared to Epiphany's 25mW - but Epiphany runs at 2x the frequency. If we normalized for frequency, Epiphany comes out 1.6x more power-efficient - and that's when we compare it's worst case power to MPPA's typical power.

MPPA doesn't support OpenCL at the moment, and I found few details about the architecture; our quick glance is only good to show that these "low-end multicore" machines have the same order of magnitude of efficiency.

So will OpenCL displace GPGPU?

The accelerators of the kind we discussed above - and they're accelerators, not CPUs, because they're horrible at running large programs as opposed to hand-optimized kernels - these accelerators have some nice properties:

  • You get scalar threads which can diverge freely and efficiently - this is a lot of extra flexibility compared to SIMT or SIMD GPUs.
  • For GPGPU workloads that don't need divergence, these accelerators probably aren't much worse than GPUs. You lose some power efficiency because of reading the same instructions from many program memories, but it should be way less than a 2x loss, I'd guess.
  • And there's a programming model ready for these accelerators - OpenCL. They can be programmed in other C dialects, but OpenCL is a widespread, standard one that can be used, and it lets you use features like explicitly managed local memory and DMA in a portable way.

From a programmer's perspective - bring them on! Why not have something with a standard programming interface, more efficient than CPUs, more flexible than GPUs - and running existing GPGPU programs almost as well as GPUs?

There are several roadblocks, however. First of all, there's no killer app for this type of thing - by definition. That is, for any killer app, almost certainly a much more efficient accelerator can be built for that domain. Generic OpenCL accelerators are good at accelerating the widest range of things, but they don't excel at accelerating anything.

There is, of course, at least one thriving platform which is, according to the common perception, "good at everything but excels at nothing" - FPGA. (I think it's more complicated than that but I'll leave it for another time.)

FPGAs are great for small to medium scale product delivery. The volume is too small to afford your own chip - but there may be too many things to accelerate which are too different from what an existing chip is good at accelerating. Flexible OpenCL accelerator chips could rival FPGAs here.

What about integrating these accelerators into high-volume chips such as application processors so they could compete with GPUs? Without a killer app, there's a real estate problem. At 100-150 mm^2, today's application processors are already rather large. And the new OpenGL accelerators aren't exactly small - they're bigger than any domain-specific accelerator.

Few chips are likely to include a large accelerator "just in case", without a killer app. Area is considered to be getting increasingly cheap. But we're far from the point where it's "virtually free", and the trend might not continue forever: GlobalFoundries' 14nm is a "low-shrink" node. Today, area is not free.

Of course, a new OpenCL accelerator does give some immediate benefit and so it isn't a purely speculative investment. That's because you could speed up existing OpenCL applications. But for existing code which is careful to avoid divergence, the accelerator would be somewhat less efficient than a GPU, and it wouldn't do graphics nearly as good as the GPU - so it'd be a rather speculative addition indeed.

What would make one add hardware for speculative reasons? A long life cycle. If you believe that your chip will have to accelerate important stuff many years after it's designed, then you'll doubt your ability to predict exactly what this stuff is going to be, and you'll want the most general-purpose accelerator.

Conversely, if you make new chips all the time, quickly sell a load of them, and then move on to market your next design, then you're less inclined to speculate. Anything that doesn't result in a visibly better product today is not worth the cost.

So generic OpenCL accelerators have a better shot at domains with long life cycles, which seem to be a minority. And then even when you found a vendor with a focus on the long term, you have the problem of performance portability.

Let's say platform vendor A does add the new accelerator to their chip. Awesome - except you probably also want to support vendor B's chips, which don't have such accelerators. And so efficient divergence is of no use to you, because it's not portable. Unless vendor A accounts for a very large share of the market - or if it's a dedicated device and you write a dedicated program and you don't care about portability.

OpenCL programs are portable, but their performance is not portable. For instance, if you use vector data types and the target platform doesn't have SIMD, the code will be compiled to scalar instructions, and so on.

What this means in practice is that one or several OpenCL subsets will emerge, containing features that people count on to be supported well. For instance, a relatively good scenario is, there's a subset that GPU programmers use on all GPUs. A worse scenario is, there's the desktop GPU subset and the mobile GPU subset. A still worse scenario is, there's the NVIDIA subset, the AMD subset, the Imagination subset, etc.

It's an evolving type of thing that's never codified anywhere but has more power than the actual standard.

Standards tend to materialize partially. For example, the C standard supports garbage collection, but real C implementations usually don't, and many real C programs will not run correctly on a standard-compliant, garbage-collecting implementation. Someone knowing C would predict this outcome, and would not trust the standard to change it.

So with efficient divergence, the question is, will this feature make it into a widely used OpenCL subset, even though it's not a part of any such subset today. If it doesn't, widespread hardware is unlikely to support it.

Personally, I like accelerators with efficient divergence. I'll say it again:

"From a programmer's perspective - bring them on! Why not have something with a standard programming interface, more efficient than CPUs, more flexible than GPUs - and running existing GPGPU programs almost as well as GPUs?"

From an evolutionary standpoint though, it's quite the uphill battle. The CPU + GPU combination is considered "good enough" very widely. It's not impossible to grow in the shadow of a "good enough", established competitor. x86 was good enough and ARM got big, gcc was good enough and LLVM got big, etc.

It's just hard, especially if you can't replace the competitor anywhere and you aren't a must-have. A CPU is a must-have and ARM replaces x86 where it wins. A compiler is a must-have and LLVM replaces gcc where it wins. An OpenCL accelerator with efficient divergence - or any other kind, really - is not a must-have and it will replace neither the CPU nor the GPU. So it's quite a challenge to convince someone to spend on it.

Conclusion

I doubt that general-purpose OpenCL accelerators will displace GPGPU, even though it could be a nice outcome. The accelerators probably will find their uses though. The following properties seem favorable to them (all or a subset may be present in a given domain):

  • Small to medium scale, similarly to FPGAs
  • Long life cycles encouraging speculative investment in hardware
  • Device-specific software that can live without performance portability

In other words, there can be "life between the CPU and the GPU", though not necessarily in the highest volume devices.

Good luck to Adapteva with their Kickstarter project - a computer with 64 independent OpenCL cores for $99.

"It's done in hardware so it's cheap"

It isn't.

This is one of these things that look very obvious to me, to the point where it seems not worth discussing. However, I've heard the idea that "hardware magically makes things cheap" from several PhDs over the years. So apparently, if you aren't into hardware, it's not obvious at all.

So why doesn't "hardware support" automatically translate to "low cost"/"efficiency"? The short answer is, hardware is an electric circuit and you can't do magic with that, there are rules. So what are the rules? We know that hardware support does help at times. When does it, and when doesn't it?

To see the limitations of hardware support, let's first look at what hardware can do to speed things up. Roughly, you can really only do two things:

  1. Specialization - save dispatching costs in speed and energy.
  2. Parallelization - save time, but not energy, by throwing more hardware at the job.

Let's briefly look at examples of these two speed-up methods - and then some examples where hardware support does nothing for you, because none of the two methods helps. We'll only consider run time and energy per operation, ignoring silicon area (considering a third variable just makes it too hairy).

I'll also discuss the difference between real costs of operations and the price you pay for these operations, and argue that in the long run, costs are more stable and more important than prices.

Specialization: cheaper dispatching

If you want to extract bits 3 to 7 of a 32-bit word and then multiply them by 13 - let's say an encryption algorithm requires this - you can have an instruction doing just that. That will be faster and use less energy than, say, using bitwise AND, shift & multiplication instructions.

Why - what costs were cut out? The costs of dispatching individual operations - circuitry controlling which operation is executed, where the inputs come from and where the outputs go.

Specialization can be taken to an extreme. For instance, if you want a piece of hardware doing nothing but JPEG decoding, you can bring dispatching costs close to zero by having a single "instruction" - "decode a JPEG image". Then you have no flexibility - and none of the "overhead" circuitry found in more flexible machines (memory for storing instructions, logic for decoding these instructions, multiplexers choosing the registers that inputs come from based on these instructions, etc.)

Before moving on, let's look a little closer at why we won here:

  • We got a speed-up because the operations were fast to begin with - so dispatching costs dominated. With specialization, we need 4 wires connected directly to bits 3 to 7 that have tiny physical delay - just the time it takes the signal to travel to a nearby multiplier-by-13. Without specialization, we'd use a shifter shifting by a configurable amount of bits - 3 in our case but not always - which is a bunch of gates introducing a much larger delay. On top of that, since we'd be using several such circuits communicating through registers (let's say we're on a RISC CPU), we'd have delays due to reading and writing registers, delays due to selecting registers from a large register file, etc. With all this taken out by having a specialized instruction, no wonder we're seeing a big speed-up.
  • Likewise, we'll see lower energy consumption because the operations didn't require a lot of energy to begin with. Roughly, most of the energy is consumed when a signal value changes from 1 to 0 or back. When we use general-purpose instructions, most of the gate inputs & outputs and most flip-flops changing their values are those implementing the dispatching. When we use a specialized instruction, most of the switching is gone.

This means that, unsurprisingly, there's a limit to efficiency - the fundamental cost of the operations we need to do, which can't be cut.

When the operations themselves are costly enough - for instance, memory access or floating point operations - then their cost dominates the cost of dispatching. So specialized instructions that cut dispatching costs will give us little or nothing.

Parallelization: throwing more hardware at the job

What to do when specialization doesn't help? We can simply have N processors instead of one. For the parts that can be parallelized, we'll cut the run time by N - but spend the same amount of energy. So things got faster but not necessarily cheaper. A fixed power budget limits parallelization - as does a fixed budget of, well, money (the price of a 1000-CPU rack is still not trivial today).

[Why have multicore chips if it saves no energy? Because a multicore chip is cheaper than many single core ones, and because, above a certain frequency, many low-frequency cores use less energy than few high-frequency ones.]

We can combine parallelization with specialization - in fact it's done very frequently. Actually a JPEG decoder mentioned above would do that - a lot of its specialized circuits would execute in parallel.

Another example is how SIMD or SIMT processors broadcast a single instruction to multiple execution units. This way, we get only a speed-up, but no energy savings at the execution unit level: instead of one floating point ALU, we now have 4, or 32, etc. We do, however, get energy savings at the dispatching level - we save on program memory and decoding logic. As always with specialization, we pay in flexibility - we can't have our ALUs do different things at the same time, as some programs might want to.

Why do we see more single-precision floating point SIMD than double-precision SIMD? Because the higher the raw cost of operations, the less we save by specialization, and SIMD is a sort of specialization. If we have to pay for double-precision ALUs, why not put each in a full-blown CPU core? That way, at least we get the most flexibility, which means more opportunities to actually use the hardware rather than keeping it idle.

(It's really more complicated than that because SIMD can actually be a more useful programming model than multiple threads or processes in some cases, but we won't dwell on that.)

What can't be done

Now that we know what can be done - and there really isn't anything else - we basically already also know what can't be done. Let's look at some examples.

Precision costs are forever

8-bit integers are fundamentally more efficient than 32-bit floating point, and no hardware support for any sort of floating point operations can change this.

For one thing, multiplier circuit size (and energy consumption) is roughly quadratic in the size of inputs. IEEE 32b floating point numbers have 23b mantissas, so multiplying them means a ~9x larger circuit than an 8×8-bit multiplier with the same throughput. Another cost, linear in size, is that you need more memory, flip-flops and wires to store and transfer a float than an int8.

(People are more often aware of this one because SIMD instruction sets usually have fixed-sized registers which can be used to keep, say, 4 floats or 16 uint8s. However, this makes people underestimate the overhead of floating point as 4x - when it's more like 9x if you look at multiplying mantissas, not to mention handling exponents. Even int16 is 4x more costly to multiply than int8, not 2x as the storage space difference makes one guess.)

We design our own chips, and occasionally people say that it'd be nice to have a chip with, say, 256 floating point ALUs. This sounds economically nonsensical - sure it's nice and it's also quite obvious, so if nobody makes such chips at a budget similar to ours, it must be impossible, so why ask?

But actually it's a rather sensible suggestion, in that you can make a chip with 256 ALUs that is more efficient than anything on the market for what you do, but not flexible enough to be marketed as a general-purpose computer. That's precisely what specialization does.

However, specialization only helps with operations which are cheap enough to begin with compared to the cost of dispatching. So this can work with low-precision ALUs, but not with high-precision ALUs. With high-precision ALUs, the raw cost of operations would exceed our power budget, even if dispatching costs were zero.

Memory indirection costs are forever

I mentioned this in my old needlessly combative write-up about "high-level CPUs". There's this idea that we can have a machine that makes "high-level languages" run fast, and that they're really only slow because we're running on "C machines" as opposed to Lisp machines/Ruby machines/etc.

Leaving aside the question of what "high-level language" means (I really don't find it obvious at all, but never mind), object-orientation and dynamic typing frequently result in indirection: pointers instead of values and pointers to pointers instead of pointers. Sometimes it's done for no apparent reason - for instance, Erlang strings that are kept as linked lists of ints. (Why do people even like linked lists as "the" data structure and head/tail recursion as "the" control structure? But I digress.)

This kind of thing can never be sped up by specialization, because memory access fundamentally takes quite a lot of time and energy, and when you do p->a, you need one such access, and when you do p->q->a, you need two, hence you'll spend twice the time. Having a single "LOAD_LOAD" instruction instead of two - LOAD followed by a LOAD - does nothing for you.

All you can do is parallelization - throw more hardware at the problem, N processors instead of one. You can, alternatively, combine parallelization with specialization, similarly to N-way floating point SIMD that's somewhat cheaper than having N full-blown processors. For example, you could have several load-store units and several cache banks and a multiple-issue processor. Than if you had to run p1->q1->a and somewhere near that, p2->q2->b, and the pointers point into different banks, some of the 4 LOADs would end up running in parallel, without having several processors.

But, similarly to low-precision math being cheaper whatever the merits of floating point SIMD, one memory access is always twice cheaper than two despite the merits of cache banking and multiple issue. Specifically, doubling the memory access throughput roughly doubles the energy cost. This can sometimes be better than simply using two processors, but it's a non-trivial cost and will always be.

A note about latency

We could discuss other examples but these two are among the most popular - floating point support is a favorite among math geeks, and memory indirection support is a favorite among language geeks. So we'll move on to a general conclusion - but first, we should mention the difference between latency costs and throughput costs.

In our two examples, we only discussed throughput costs. A floating point ALU with a given throughout uses more energy than an int8 ALU. Two memory banks with a given throughput use about twice the energy of a single memory bank with half the throughput. This, together with the relatively high costs of these operations compared to the costs of dispatching them, made us conclude that we have nothing to do.

In reality, the high latency of such heavyweight operations can be the bigger problem than our inability to increase their throughput without paying a high price in energy. For example, consider the instruction sequence:

c = FIRST(a,b)
e = SECOND(c,d)

If FIRST has a low latency, then we'll quickly proceed to SECOND. If FIRST has a high latency, then SECOND will have to wait for that amount of time, even if FIRST has excellent throughput. Say, if FIRST is a LOAD, being able to issue a LOAD every cycle doesn't help if SECOND depends on the result of that LOAD and the LOAD latency is 5 cycles.

A large part of computer architecture is various methods for dealing with these latencies - VLIW, out-of-order, barrel processors/SIMT, etc. These are all forms of parallelization - finding something to do in parallel with the high-latency instruction. A barrel processor helps when you have many threads. An out-of-order processor helps when you have nearby independent instructions in the same thread. And so on.

Just like having N processors, all these parallelization methods don't lower dispatching costs - in fact, they raise them (more registers, higher issue bandwidth, tricky dispatching logic, etc.) The processor doesn't become more energy efficient - you get more done per unit of time but not per unit of energy. A simple processor would be stuck at the FIRST instruction, while a more clever one would find something to do - and spend the energy to do it.

So latency is a very important problem with fundamentally heavyweight operations, and machinery for hiding this latency is extremely consequential for execution speed. But fighting latency using any of the available methods is just a special case of parallelization, and in this sense not fundamentally different from simply having many cores in terms of energy consumed.

The upshot is that parallelization, whether it's having many cores or having single-core latency-hiding circuitry, can help you with execution speed - throughput per cycle - but not with energy efficiency - throughput per watt.

The latency of heavyweight stuff is important and not hopeless; its throughput is important and hopeless.

Cost vs price

"But on my GPU, floating point operations are actually as fast as int8 operations! How about that?"

Well, a bus ticket can be cheaper than the price of getting to the same place in a taxi. The bus ticket will be cheaper even if you're the only passenger, in which case the real cost of getting from A to B in a bus is surely higher than the cost of getting from A to B in a taxi. Moreover, a bus might take you there more quickly if there are lanes reserved for buses that taxis are not allowed to use.

It's basically a cost vs price thing - math and physics vs economics and marketing. The fundamentals only say that a hardware vendor always can make int8 cheaper than float - but they can have good reasons not to. It's not that they made floats as cheap as int8 - actually, they made int8 as expensive as floats in terms of real costs.

Just like you going alone in a bus designed to carry dozens of people is an inefficient use of a bus, using float ALUs to process what could be int8 numbers is an inefficient use of float ALUs. Similarly, just like transport regulations can make lanes available for buses but not cars, an instruction set can make fetching a float easy but make fetching a single byte hard (no load byte/load byte with sign extension instructions). But cars could use those lanes - and loading bytes could be made easy.

As a passenger, of course you will use the bus and not the taxi, because economics and/or marketing and/or regulations made it the cheaper option in terms of price. Perhaps it's so because the bus is cheaper overall, with all the passengers it carries during rush hours. Perhaps it's so because the bus is a part of the contract with your employer - it's a bus carrying employees towards a nearby something. And perhaps it's so because the bus is subsidized by the government. Whatever the reason, you go ahead and use the cheaper bus.

Likewise, as a programmer, if you're handed a platform where floating point is not more expensive or even cheaper than int8, it is perhaps wise to use floating point everywhere. The only things to note are, the vendor could have given you better int8 performance; and, at some point, a platform might emerge that you want to target and where int8 is much more efficient than float.

The upshot is that it's possible to lower the price of floating point relative to int8, but not the cost.

What's more "important" - prices or costs?

Prices have many nice properties that real costs don't have. For instance, all prices can be compared - just convert them all to your currency of choice. Real costs are hard to compare without prices: is 2x less time for 3x more energy better or worse?

In any discussion about "fundamental real costs", there tend to be hidden assumptions about prices. For example, I chose to ignore area in this discussion under the assumption that area is usually less important than power. What makes this assumption true - or false - is the prices fabs charge for silicon production, the sort of cooling solutions that are marketable today (a desktop fan could be used to cool a cell phone but you couldn't sell that phone), etc. It's really hard to separate costs from prices.

Here's a computer architect's argument to the effect of "look at prices, not costs":

While technical metrics like performance, power, and programmer effort make up for nice fuzzy debates, it is pivotal for every computer guy to understand that “Dollar” is the one metric that rules them all. The other metrics are just sub-metrics derived from the dollar: Performance matters because that’s what customers pay for; power matters because it allows OEMs to put cheaper, smaller batteries and reduce people’s electricity bills; and programmer effort matters because it reduces the cost of making software.

I have two objections: that prices are the effect, not the cause, and that prices are too volatile to commit to memory as a "fundamental".

Prices are the effect in the sense that, customers pay for performance because it matters, not "performance matters because customers pay for it". Or, more precisely - customers pay for performance because it matters to them. As a result - because customers pay for it - performance matters to vendors. Ultimately, the first cause is that performance matters, not that it sells.

The other thing about prices is that they're rather jittery. Even a price index designed for stability such as S&P 500 is jumping up and down like crazy. In a changing world, knowledge about costs has a longer shelf life than knowledge about prices.

For instance, power is considered cheap for desktops but expensive for servers and really expensive for mobile devices. In reality, desktops likely consume more power than servers, there being more desktops than servers. So the real costs are not like the prices - and prices change; the rise of mobile computing means rising prices for power-hungry architectures.

It seems to me that, taking the long view, the following makes sense:

  • It's best to reason in costs and project them to the relevant prices - not forget the underlying costs and "think in prices", so as to not get into habits that will become outdated when prices change.
  • If you see a high real cost "hidden" by contemporary prices, it's a good bet to assume that at some point in the future, prices will shift so that the real cost will rear its ugly head.

For example, any RISC architecture - ARM, MIPS, PowerPC, etc. - is fundamentally cheaper than, specifically, x86, in at least two ways: hardware costs - area & power - and the costs of developing said hardware. [At least so I believe; let's say that it's not as significant in my view than my other more basic examples, and I might be wrong and I'm only using this as an illustration.]

In the long run, this spells doom for the x86, whatever momentum it otherwise has at any point in time - software compatibility costs, Intel's manufacturing capabilities vs competitors capabilities, etc. Mathematically or physically fundamental costs will, in the long run, trump everything else.

In the long run, there is no x86, no ARM, no Windows, no iPhone, etc. There are just ideas. We remember ideas originating in ancient Greece and Rome, but no products. Every product is eventually outsold by another product. Old software is forgotten and old fabs rot. But fundamentals are forever. An idea that is sufficiently more costly fundamentally than a competing idea can not survive.

This is why I disagree with the following quote by Bob Colwell - the chief architect of the Pentium Pro (BTW, I love the interview and intend to publish a summary of the entire 160-something page document):

…you might say that CISC only stayed viable because Intel was able to throw a lot of money and people at it, and die size, bigger chips and so on.

In that sense, RISC still was better, which is what was claimed all along. And I said you know, there's point to be made there. I agree with you that Intel had more to do to stay competitive. They were starting a race from far behind the start line. But if you can throw money at a problem then, it's not really so fundamental technologically, is it? We look for more deep things than that, so if all the RISC/CISC thing amounted to was, you had a slight advantage economically, well, that's not as profound as it seemed back in the 80s was it?

Well, here's my counter-argument and it's not technical. The technical argument would be, CISC is worse, to the point where Intel's 32nm Medfield performs about as well as ARM-based 40nm chips in a space where power matters. Which can be countered with an economical argument - so what, Intel does have a better manufacturing ability so who cares, they still compete.

But my non-technical argument is, sure, you can be extremely savvy business-wise, and perhaps, if Intel realized early on how big mobile is going to be, they'd make a good enough x86-based offering back then and then everyone would have been locked out due to software compatibility issues and they'd reign like they reign in the desktop market.

But you can't do that forever. Every company is going to lose to some company at some point or other because you only need one big mistake and you'll make it, you'll ignore a single emerging market and that will be the end. Or, someone will outperform you technically - build a better fab, etc. If an idea is only ("only"?) being dragged into the future kicking and screaming by a very business-savvy and technically excellent company, then the idea has no chance.

The idea that will win is the idea that every new product will use. New products always beat old products - always have.

And nobody, nobody at all has made a new CISC architecture in ages. Intel will lose to a company or companies making RISC CPUs because nobody makes anything else - and it has to lose to someone. Right now it seems like it's ARM but it doesn't matter how it comes out in this round. It will happen at some point or other.

And if ARM beats x86, it won't be, straightforwardly, "because RISC is better" - x86 will have lost for business reasons, and it could have gone the other way for business reasons. But the fact that it will have lost to a RISC - that will be because RISC is technically better. That's why there's no CISC competitor to lose to.

Or, if you dismiss this with the sensible "in the long run, we're all dead" - then, well, if you're alive right now and you're designing hardware, you are not making a CISC processor, are you? QED, not?

Getting back to our subject - based on the assumption that real costs matter, I believe that ugly, specialized hardware is forever. It doesn't matter how much money is poured into general-purpose computing, by whom and why. You will always have sufficiently important tasks that can be accomplished 10x or 100x more cheaply by using fundamentally cheap operations, and it will pay off for someone to make the ugly hardware and write the ugly, low-level code doing low-precision arithmetic to make it work.

And, on the other hand, the market for general-purpose hardware is always going to be huge, in particular, because there are so many things that must be done where specialization fundamentally doesn't help at all.

Conclusion

Hardware can only deliver "efficiency miracles" for operations that are fundamentally cheap to begin with. This is done by lowering dispatching costs and so increasing throughput per unit of energy. The price paid is reduced flexibility.

Some operations, such as high-precision arithmetic and memory access, are fundamentally expensive in terms of energy consumed to reach a given throughput. With these, hardware can still give you more speed through parallelization, but at an energy cost that may be prohibitive.

Hardware macroarchitecture vs mircoarchitecture

The comp-arch.net wiki defines "computer architecture" as the union of two things:

  • Macroarchitecture - the "visible" parts, the contract between hardware and software. For example, branch instructions.
  • Microarchitecture - the "invisible" parts, the implementation strategy which affects performance but not semantics. For example, branch prediction.

I think this distinction is very interesting for three reasons:

  1. It pops up everywhere. That is, many hardware problems can be addressed at the macro level or the micro level, explicitly or implicitly.
  2. The choice of macro vs micro is rarely trivial - for most problems, there are real-world examples of both kinds of solutions.
  3. The choice has common consequences across problems. The benefits and drawbacks of macro and micro are frequently similar.

I'll use examples from diverse types of hardware - CPUs, DSPs, GPUs, FPGAs, CAPPs, and even DRAM controllers. We'll discuss some example problems and how they can be solved at the macro or micro level. I'll leave the discussion of the resulting trade-offs to separate write-ups. Here, we'll go through examples to see how practical macro and micro solutions to different problems look like.

Our examples are:

  • Data parallelism: SIMD vs SIMT
  • Multiple issue: VLIW vs superscalar
  • Running ahead: exposed latencies vs OOO
  • Local storage: bare RAM vs cache
  • Streaming access: DMA vs hardware prefetchers
  • Data processing: logic synthesis vs instruction sets
  • Local communication: routing vs register addresses
  • Avoiding starvation: pressure signals vs request aging

Data parallelism: SIMD vs SIMT

Suppose you want to have a data-parallel machine: software issues one instruction that processes multiple data items.

The common macro approach is wide registers and SIMD opcodes. To use the feature, software must explicitly break up its data into 16-byte chunks, and use special opcodes like "add_16_bytes" to process the chunks.

One mirco approach is what NVIDIA marketing calls SIMT. The instruction set remains scalar. However, hw runs multiple scalar threads at once, with simultaneously running threads all executing the same instruction. That way, 16 pairs of values are added in a single cycle using scalar instructions.

(If you're interested in SIMT, a detailed comparison with SIMD as well as SMT - the more general simultaneous multithreading model - is here.)

Multiple issue: VLIW vs superscalar/OOO

Suppose you want to have a multiple issue machine. You want to simultaneously issue multiple instructions from a single thread.

The macro approach is VLIW, which stands for "very long instruction word". The idea is, those multiple instructions you issue become "one (very long) instruction", because software explicitly asks to run them together: "ADD R0, R1, R2 and MUL R3, R0, R5". Note that ADD and MUL "see" the same value of R0: MUL gets R0's value before it's modified by ADD.

VLIW also lets software choose to say, "ADD R0, R1, R2; afterwards, MUL R3, R0, R5" - that's two separate instructions yielding vanilla serial execution. This is not only slower (2 cycles instead of 1), but has a different meaning. This way, MUL does see ADD's change to R0. Either way, you get what you explicitly asked for.

(If you're interested in VLIW, an explanation of how programs map to this rather strangely-looking architecture is here.)

The micro approach, called superscalar execution, is having the hardware analyze the instructions and run them in parallel - when that doesn't change the hw/sw contract (the serial semantics). For example, ADD R0, R1, R2 can run in parallel with MUL R3, R1, R2 - but not with MUL R3, R0, R5 where MUL's input, R0, depends on ADD. Software remains unaware of instruction-level parallelism - or at least it can remain unaware and still run correctly.

Running ahead: exposed latencies vs OOO

We've just discussed issuing multiple instructions simultaneously. A related topic is issuing instructions before a previous instruction completes. Here, the macro approach is to, well, simply go ahead and issue instructions. It's the duty of software to make sure those instructions don't depend on results that are not yet available.

For example, a LOAD instruction can have a 4-cycle latency. Then if you load to R0 from R1 and at the next cycle, add R0 and R2, you will have used the old value of R0. If you want the new value, you must explicitly wait for 4 cycles, hopefully issuing some other useful instructions in the meanwhile.

The micro approach to handling latency is called OOO (out-of-order execution). Suppose you load to R0, then add R0 and R2, and then multiply R3 and R4. An OOO processor will notice that the addition's input is not yet available, proceed to the multiplication because its inputs are ready, and execute the addition once R0 is loaded (in our example, after 4 cycles). The hw/sw contract is unaffected by the fact that hardware issues instructions before a previous instruction completes.

Local storage: local memories vs caches

Suppose you want to have some RAM local to your processor, so that much of the memory operations work with this fast RAM and not the external RAM, which is increasingly becoming a bottleneck.

The macro approach is, you just add local RAM. There can be special load/store opcodes to access this RAM, or a special address range mapped to it. Either way, when software wants to use local RAM, it must explicitly ask for it - as in, char* p = (char*)0×54000, which says, "I'll use a pointer pointing to this magic address, 0×54000, which is the base address of my local RAM".

This is done on many embedded DSPs and even CPUs - for example, ARM calls this "tightly-coupled memory" and MIPS calls this "scratch pad memory".

The micro approach is caches. Software doesn't ask to use a cache - it loads from an external address as if the cache didn't exist. It's up to hardware to:

  • Check if the data is already in the cache
  • If it isn't, load it to the cache
  • Decide which cached data will be overwritten with the new data
  • If the overwritten cached data was modified, write it back to external memory before "forgetting" it

The hardware changes greatly, the hw/sw contract does not.

Data streaming: DMA vs hardware prefetchers

Suppose you want to support efficient "streaming transfers". DRAM is actually a fairly poor random access memory - there's a big latency you pay per address. However, it has excellent throughput if you load a large contiguous chunk of data. To utilize this, a processor must issue loads without waiting for results of previous loads. Load, wait, load, wait… is slow; load, load, load, load… is fast.

The macro approach is, sw tells hw that it wants to load an array. For example, a DMA - direct memory access - engine can have control registers telling it the base address and the size of an array to load. Software explicitly programs these registers and says, "load".

DMA starts loading and eventually says, "done" - for example, by setting a bit. In the meanwhile, sw does some unrelated stuff until it needs the loaded data. At this point, sw waits until the "done" bit is set, and then uses the data.

The micro approach is, software simply loads the array "as usual". Naturally, it loads from the base address, p, then from p+1, then p+2, p+3, etc. At some point, a hardware prefetcher quietly inspecting all the loads realizes that a contiguous array is being loaded. It then speculatively fetches ahead - loads large chunks beyond p+3 (hopefully not too large - we don't want to load too much unneeded data past the end of our array).

When software is about to ask for, say, p+7, its request is suddenly satisfied very quickly because the data is already in the cache. This keeps working nicely with p+8 and so on.

Data processing: logic synthesis vs instruction sets

Let's get back to basics. Suppose we want to add a bunch of numbers. How does software tell hardware to add numbers?

The micro approach is so much more common that it's the only one that springs to mind. Why, of course hardware has an ADD command, and it's implemented in hardware by some sort of circuit. There are degrees here (should there be a DIV opcode or should sw do division?) But the upshot is, there are opcodes.

However, there are architectures where software explicitly constructs data processing operations out of bit-level boolean primitives. This is famously done on FPGAs and is called "logic synthesis" - effectively software gets to build its own circuits. (This programming model is so uncommon that it isn't even called "software", but it certainly is.)

Less famously, this is also what effectively happens on associative memory processors (CAPPs/APAs) - addition is implemented as a series of bit-level masked compare & write operations. (The CAPP way results in awfully long latencies, which you're supposed to make up with throughput by processing thousands of elements in parallel. If you're interested in CAPPs, an overview is available here.)

Of course, you can simulate multiplication using bitwise operations on conventional opcode-based processors. But that would leave much of the hardware unused. On FPGAs and CAPPs, on the other hand, "building things out of bits" is how you're supposed to utilize hardware resources. You get a big heap of exposed computational primitives, and you map your design to them.

Local communication: routing vs register addresses

Another problem as basic as data processing operations is local communication: how does an operation pass its results to the next? We multiply and then add - how does the addition get the output of multiplication?

Again, the micro approach is by far the better known one. The idea is, you have registers, which are numbered somehow. We ask MUL to output to the register R5 (encoded as, say, 5). Then we ask ADD to use R5 as an input.

This actually doesn't sound as "micro" - what's implicit about it? We asked for R5 very explicitly. However, there are two sorts of "implicit" things going on here:

  • The numbers don't necessarily refer to physical registers - they don't on machines with register renaming.
  • More fundamentally, even when numbers do refer to physical registers, the routing is implicit.

How does the output of MUL travel to R5 and then to the input port of the adder? There are wires connecting these things, and multiplexers selecting between the various options. On most machines, there are also data forwarding mechanisms sending the output of MUL directly to the adder, in parallel to writing it into R5, so that ADD doesn't have to wait until R5 is written and then read back. But even on machines with explicit forwarding (and there aren't many), software doesn't see the wires and muxes - these are opaque hardware resources.

The macro approach to routing is what FPGAs do. The various computational units are connected to nearby configurable switches. By properly configuring those switches, you can send the output of one unit to another using a path going through several switches.

Of course this uses up the wires connecting the switches, and longer paths result in larger latencies. So it's not easy to efficiently connect all the units that you want using the switches and the wires that you have. In FPGAs, mapping operations to computational units and connecting between them is called "placement and routing". The "place & route" tools can run for a couple of hours given a large design.

This example as well as the previous illustrate micro vs macro at the extreme - a hardware resource that looks "all-important" in one architecture is invisible in another to the point where we forget it exists. The point is that they're equally important on both - the only question is who manages the resource, hardware or software.

Avoiding starvation: pressure signals vs request aging

One thing DRAM controllers do is accept requests from several different processors, put them in a queue, and reorder them. Reordering helps to better utilize DRAM, which, as previously mentioned, isn't that good at random access and prefers streaming access to consequent locations.

So if two processors, A and B, run in parallel, and each accesses a different region, it's frequently better to group requests together - A, A, A, A, B, B, B, B - then to process them in the order in which they arrive - say, A, B, A, A, B, A, B, B.

In fact, as long as A keeps issuing requests, it's frequently better to keep processing them until they're over, and keep B waiting. Better, that is, for throughput, as well as for A's latency - but worse for B's latency. If we don't know when to stop, serving A and starving B could make the system unusable.

When to stop? One macro solution is, the DRAM controller has incoming pressure signals, and both A and B can complain when starved by raising the pressure. Actually, this is "macro" only as far as the DRAM controller is concerned - it gives outside components explicit control over its behavior. The extent of software control over the generation of the pressure signal depends on the processors A and B.

One micro solution is to use request aging. Older requests are automatically considered more urgent. This method is implemented in many DRAM controllers - for instance, Denali's. The macro approach is implemented in the Arteris DRAM scheduler.

The micro approach is safer - the controller itself is careful to prevent starvation, whereas in the macro option, a non-cooperative processor can starve others. It also uses a simpler bus protocol, making compatibility easier for processors. However, it results in a lesser throughput - for instance, if B is a peripheral device with a large FIFO for incoming data, and can afford to wait for long periods of time until the FIFO overflows.

Whatever the benefits and drawbacks - and here, we aren't going to discuss benefits and drawbacks in any depth - this last example is supposed to illustrate that macro vs micro is relevant outside of "core"/"processor" design but extends to "non-computational" hardware as well.

Blurred boundaries

Micro vs macro is more of a continuum than a strictly binary distinction. That is, we can't always label a hardware feature as "visible" or "invisible" to programmers - rather, we can talk about the extent of its visibility.

There are basically two cases of "boundary blurring":

  • Hardware features "quite visible" even though they don't affect program semantics. These are "technically micro" but "macro in spirit".
  • Hardware features "quite invisible" even though they do affect program semantics. These are "technically macro" but "micro in spirit".

Let's briefly look at examples of both kinds of "blurring".

Technically micro but macro in spirit

A good example is memory banking. The point of banking is increasing the number of addresses that can be accessed per cycle. A single 32K memory bank lets you access a single address per cycle. 2 16K banks let you access 2 address, 4 8K banks let you access 4 addresses, and so on.

So basically "more is better". What limits the number of banks is the overhead you pay per bank, the overhead of logic figuring out the bank an address belongs to, and the fact that there's no point in accessing more data than you can process.

Now if we look at banking as implemented in NVIDIA GPU memory, TI DSP caches and superscalar CPU caches, then at first glance, they're all "micro solutions". These machines seem to mostly differ in their mapping of address to bank - for instance, NVIDIA GPUs switch banks every 4 bytes, while TI DSPs switch banks every few kilobytes.

But on all these machines, software can remain unaware of banking and run correctly. If two addresses are accessed at the same cycle that map to the same bank, then the access will take two cycles instead of one - but no fault is reported and results aren't affected. Semantically, banking is invisible.

However, I'd call GPUs' and DSPs' banking "macroish", and superscalar CPUs' banking "microish". Why?

GPUs and DSPs "advertise" banking, and commit to a consistent address mapping scheme and consistent performance implications across different devices. Vendors encourage you to know about banking so that you allocate data in ways minimizing contentions.

CPUs don't advertise banking very much, and different CPUs running the same instruction set have different banking schemes which result in different performance. Moreover, those CPU variants differ in their ability to access multiple addresses in parallel in the first place: a low-end CPU might access at most one address but a high-end CPU might access two.

GPUs and DSPs, on the other hand, have explicit multiple load-store units (a macro feature). So software knows when it attempts to accesses many addresses in parallel - one reason to "advertise" which addresses can actually be accessed in parallel.

This shows why hardware features that don't affect program semantics aren't "completely invisible to programmers" - rather, there are "degrees of visibility". A feature only affecting performance is "quite visible" if vendors and users consider it an important part of the hw/sw contract.

Technically macro but micro in spirit

SIMD and VLIW are both visible in assembly programs/binary instruction streams. However, SIMD is "much more macro in spirit" than VLIW. That's because for many programmers, the hw/sw contract isn't the semantics of assembly, which they never touch, but the semantics of their source language.

At the source code level, the effect of SIMD tends to be very visible. Automatic vectorization rarely works, so you end up using intrinsic functions and short vector data types. The effect of VLIW on source code can be close to zero. Compilers are great at automatic scheduling, and better than humans, so there's no reason to litter the code with any sort of annotations to help them. Hence, SIMD is "more macro" - more visible.

Moreover, there's "macroish VLIW" and "microish VLIW" - just like there's "macroish banking" and "microish banking" - and, again, the difference isn't in the hardware feature itself, but in the way it's treated by vendors and users.

An extreme example of "microish VLIW" is Transmeta - the native binary instruction encoding was VLIW, but the only software that was supposed to be aware of that were the vendor-supplied binary translators from x86 or other bytecode formats. VLIW was visible at the hardware level but still completely hidden from programmers by software tools.

An opposite, "macro-centric" example is TI's C6000 family. There's not one, but two "human-writable assembly languages". There's parallel assembly, where you get to manually schedule instructions. There's also linear assembly, which schedules instructions for you, but you still get to explicitly say which execution unit each instruction will use (well, almost; let's ignore the A-side/B-side issues here.)

Why provide such a "linear assembly" language? Josh Fisher, the inventor of VLIW, didn't approve of the concept in his book "Embedded Computing: a VLIW Approach".

That's because originally, one of the supposed benefits of VLIW was precisely being "micro in spirit" - the ability to hide VLIW behind an optimizing compiler meant that you could speed up existing code just by recompiling it. Not as easy as simply running old binaries on a stronger new out-of-order processor, but easy enough in many cases - and much easier to support at the hardware end.

Linear assembly basically throws these benefits out the window. You spell things in terms of C6000's execution units and opcodes, so the code can't be cross-platform. Worse, TI can't decide to add or remove execution units or registers from some of the C6000 variants and let the compiler reschedule instructions to fit the new variant. Linear assembly refers to units and registers explicitly enough to not support this flexibility - for instance, there's no silent spill code generation. Remove some of the resources, and much of the code will stop compiling.

Then why is linear assembly shipped by TI, and often recommended as the best source language for optimized code? The reason is that the code is more "readable" - if one of the things the reader is after is performance implications. The same silent spill code generation that makes C more portable makes it "less readable", performance-wise - you never can tell whether your data fits into registers or not, similarly it's hard to know how many operations of every execution unit are used.

The beauty of linear assembly is that it hides the stuff humans particularly hate to do and compilers excel at - such as register allocation and instruction scheduling - but it doesn't hide things making it easy to estimate performance - such as instruction selection and the distinction between stack and register variables. (In my opinion, the only problem with linear assembly is that it still hides a bit too much - and that people can choose to not use it. They often do - and preserve stunning unawareness of how the C6000 works for years and years.)

Personally, I believe that, contrary to original expectations, VLIW works better in "macro-centric" platforms than "micro-centric" - a viewpoint consistent with the relative success of Transmeta's chips and VLIW DSPs. Whether this view is right or wrong, the point here is that hardware features "visible" to software can be more or less visible to programmers - depending on how visible the software stack in its entirety makes them.

Implications

We've seen that "macro vs micro" is a trade-off appearing in a great many contexts in hardware design, and that typically, both types of solutions can be found in practical architectures - so it's not clear which is "better".

If there's no clear winner, what are the benefits and drawbacks of these two options? I believe that these benefits and drawbacks are similar across the many contexts where the trade-off occurs. Some of the implications were briefly mentioned in the discussion on VLIW's "extent of visibility" - roughly:

  • Micro is more compatible
  • Macro is more efficient
  • Macro is easier to implement

There are other common implications - for example, macro is harder to context-switch (I like this one because, while it's not very surprising once you think about it, it doesn't immediately spring to mind).

I plan to discuss the implications in detail sometime soon. I intend to focus, not as much on how things could be in theory, but on how they actually tend to come out and why.

SIMD < SIMT < SMT: parallelism in NVIDIA GPUs

Programmable NVIDIA GPUs are very inspiring to hardware geeks, proving that processors with an original, incompatible programming model can become widely used.

NVIDIA call their parallel programming model SIMT - "Single Instruction, Multiple Threads". Two other different, but related parallel programming models are SIMD - "Single Instruction, Multiple Data", and SMT - "Simultaneous Multithreading". Each model exploits a different source of parallelism:

  • In SIMD, elements of short vectors are processed in parallel.
  • In SMT, instructions of several threads are run in parallel.
  • SIMT is somewhere in between - an interesting hybrid between vector processing and hardware threading.

My presentation of SIMT is focused on hardware architecture and its implications on the trade-off between flexibility and efficiency. I'll describe how SIMT is different from SIMD and SMT, and why - what is gained (and lost) through these differences.

From a hardware design perspective, NVIDIA GPUs are at first glance really strange. The question I'll try to answer is "why would you want to build a processor that way?" I won't attempt to write a GPU programming tutorial, or quantitatively compare GPUs to other processors.

SIMD < SIMT < SMT

It can be said that SIMT is a more flexible SIMD, and SMT is in turn a more flexible SIMT. Less flexible models are generally more efficient - except when their lack of flexibility makes them useless for the task.

So in terms of flexibility, SIMD < SIMT < SMT. In terms of performance, SIMD > SIMT > SMT, but only when the models in question are flexible enough for your workload.

SIMT vs SIMD

SIMT and SIMD both approach parallelism through broadcasting the same instruction to multiple execution units. This way, you replicate the execution units, but they all share the same fetch/decode hardware.

If so, what's the difference between "single instruction, multiple data", and single instruction, multiple threads"? In NVIDIA's model, there are 3 key features that SIMD doesn't have:

  1. Single instruction, multiple register sets
  2. Single instruction, multiple addresses
  3. Single instruction, multiple flow paths

We'll see how this lifts restrictions on the set of programs that are possible to parallelize, and at what cost.

Single instruction, multiple register sets

Suppose you want to add two vectors of numbers. There are many ways to spell this. C uses a loop spelling:

for(i=0;i<n;++i) a[i]=b[i]+c[i];

Matlab uses a vector spelling:

a=b+c;

SIMD uses a "short vector" spelling - the worst of both worlds. You break your data into short vectors, and your loop processes them using instructions with ugly names. An example using C intrinsic functions mapping to ARM NEON SIMD instructions:

void add(uint32_t *a, uint32_t *b, uint32_t *c, int n) {
  for(int i=0; i<n; i+=4) {
    //compute c[i], c[i+1], c[i+2], c[i+3]
    uint32×4_t a4 = vld1q_u32(a+i);
    uint32×4_t b4 = vld1q_u32(b+i);
    uint32×4_t c4 = vaddq_u32(a4,b4);
    vst1q_u32(c+i,c4);
  }
}

SIMT uses a "scalar" spelling:

__global__ void add(float *a, float *b, float *c) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  a[i]=b[i]+c[i]; //no loop!
}

The weird __global__ keyword says that add() is a GPU thread entry point. blockIdx, blockDim and threadIdx are built-in thread-local variables keeping the thread's ID. We'll see later why a thread ID isn't just a single number; however, in this example we in fact convert it to a single number, and use it as the element index.

The idea is that the CPU spawns a thread per element, and the GPU then executes those threads. Not all of the thousands or millions of threads actually run in parallel, but many do. Specifically, an NVIDIA GPU contains several largely independent processors called "Streaming Multiprocessors" (SMs), each SM hosts several "cores", and each "core" runs a thread. For instance, Fermi has up to 16 SMs with 32 cores per SM - so up to 512 threads can run in parallel.

All threads running on the cores of an SM at a given cycle are executing the same instruction - hence Single Instruction, Multiple Threads. However, each thread has its own registers, so these instructions process different data.

Benefits

"Scalar spelling", where you write the code of a single thread using standard arithmetic operators, is arguably a better interface than SIMD loops with ugly assembly-like opcodes.

Syntax considerations aside, is this spelling more expressive - can it do things SIMD can't? Not by itself, but it dovetails nicely with other features that do make SIMT more expressive. We'll discuss those features shortly; in theory, they could be bolted on the SIMD model, but they never are.

Costs

From a hardware resources perspective, there are two costs to the SIMT way:

  • Registers spent to keep redundant data items. SIMT registersIn our example, the pointers a, b, and c have the same value in all threads. The values of i are different across threads, but in a trivial way - for instance, 128, 129, 130… In SIMD, a, b, and c would be kept once in “scalar” registers - only short vectors such as a[i:i+4] would be kept in “vector” registers. The index i would also be kept once - several neighbor elements starting from i would be accessed without actually computing i+1, i+2, etc. Redundant computations both waste registers and needlessly consume power. Note, however, that a combination of compiler & hardware optimizations could eliminate the physical replication of redundant values. I don't know the extent to which it's done in reality.
  • Narrow data types are as costly as wide data types. SIMD registersA SIMD vector register keeping 4 32b integers can typically also be used to keep 8 16b integers, or 16 8b ones. Similarly, the same ALU hardware can be used for many narrow additions or fewer wide ones - so 16 byte pairs can be added in one cycle, or 4 32b integer pairs. In SIMT, a thread adds two items at a time, no matter what their width, wasting register bits and ALU circuitry.

It should be noted that SIMT can be easily amended with SIMD extensions for narrow types, so that each thread processes 4 bytes at a time using ugly assembly-like opcodes. AFAIK, NVIDIA refrains from this, presumably assuming that the ugliness is not worth the gain, with 32b float being the most popular data type in graphics anyway.

Single instruction, multiple addresses

Let's apply a function approximated by a look-up table to the elements of a vector:

__global__ void apply(short* a, short* b, short* lut) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  a[i] = lut[b[i]]; //indirect memory access
}

Here, i is again "redundant" in the sense that in parallel threads, the values of i are consecutive. However, each thread then accesses the address lut+b[i] - and these addresses are not consecutive.

Roughly, such parallel random access works for both loads and stores. Logically, stores are the trickier part because of conflicts. What if two or more threads attempt to, say, increment the same bin in a histogram? Different NVIDIA GPU generations provide different solutions that we won't dwell on.

Benefits

This feature lets you parallelize many programs you can't with SIMD. Some form of parallel random access is usually available on SIMD machines under the names "permutation", "shuffling", "table look-up", etc. However, it always works with registers, not with memory, so it's just not the same scale. You index into a table of 8, 16 or 32 elements, but not 16K.

As previously mentioned, in theory this feature can be bolted on the SIMD model: just compute your addresses (say, lut+b[i]) in vector registers, and add a rand_vec_load instruction. However, such an instruction would have a fairly high latency. As we'll see, the SIMT model naturally absorbs high latencies without hurting throughput; SIMD much less so.

Costs

GPU has many kinds of memory: external DRAM, L2 cache, texture memory, constant memory, shared memory… We'll discuss the cost of random access in the context of two memories "at the extremes": DRAM and shared memory. DRAM is the farthest from the GPU cores, sitting outside the chip. Shared memory is the closest to the cores - it's local to an SM, and the cores of an SM can use it to share results with each other, or for their own temporary data.

  • With DRAM memory, random access is never efficient. In fact, the GPU hardware looks at all memory addresses that the running threads want to access at a given cycle, and attempts to coalesce them into a single DRAM access - in case they are not random. Effectively the contiguous range from i to i+#threads is reverse-engineered from the explicitly computed i,i+1,i+2… - another cost of replicating the index in the first place. If the indexes are in fact random and can not be coalesced, the performance loss depends on "the degree of randomness". This loss results from the DRAM architecture quite directly, the GPU being unable to do much about it - similarly to any other processor.
  • With shared memory, random access is slowed down by bank contentions. Generally, a hardware memory module will only service one access at a time. So shared memory is organized in independent banks; the number of banks for NVIDIA GPUs is 16. If x is a variable in shared memory, then it resides in bank number (&x/4)%16. In other words, if you traverse an array, the bank you hit changes every 4 bytes. Access throughput peaks if all addresses are in different banks - hardware contention detection logic always costs latency, but only actual contentions cost throughput. If there's a bank hosting 2 of the addresses, the throughput is 1/2 of the peak; if there's a bank pointed by 3 addresses, the throughput is 1/3 of the peak, etc., the worst slowdown being 1/16.

SIMT bank contentions

In theory, different mappings between banks and addresses are possible, each with its own shortcomings. For instance, with NVIDIA's mapping, accessing a contiguous array of floats gives peak throughput, but a contiguous array of bytes gives 1/4 of the throughput (since banks change every 4 bytes). Many of the GPU programming tricks aim at avoiding contentions.

For instance, with a byte array, you can frequently work with bytes at distance 4 from each other at every given step. Instead of accessing a[i] in your code, you access a[i*4], a[i*4+1], a[i*4+2] and a[i*4+3] - more code, but less contentions.

This sounds convoluted, but it's a relatively cheap way for the hardware to provide efficient random access. It also supports some very complicated access patterns with good average efficiency - by handling the frequent case of few contentions quickly, and the infrequent case of many contentions correctly.

Single instruction, multiple flow paths

Let's find the indexes of non-zero elements in a vector. This time, each thread will work on several elements instead of just one:

__global__ void find(int* vec, int len,
                     int* ind, int* nfound,
                     int nthreads) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  int last = 0;
  int* myind = ind + tid*len;
  for(int i=tid; i<len; i+=nthreads) {
    if(vec[i]) { //flow divergence
      myind[last] = i;
      last++;
    }
  }
  nfound[tid] = last;
}

Each thread processes len/nthreads elements, starting at the index equal to its ID with a step of nthreads. We could make each thread responsible for a more natural contiguous range using a step of 1. The way we did it is better in that accesses to vec[i] by concurrent threads address neighbor elements, so we get coalescing.

The interesting bit is if(vec[i]) - depending on vec[i], some threads execute the code saving the index, some don't. The control flow of different threads can thus diverge.

Benefits

Support for divergence further expands the set of parallelizable programs, especially when used together with parallel random access. SIMD has some support for conditional execution though vector "select" operations: select(flag,a,b) = if flag then a else b. However, select can't be used to suppress updates to values - the way myind[last] is never written by threads where vec[i] is 0.

SIMD instructions such as stores could be extended to suppress updates based on a boolean vector register. For this to be really useful, the machine probably also needs parallel random access (for instance, the find() example wouldn't work otherwise). Unless what seems like an unrealistically smart compiler arrives, this also gets more and more ugly, whereas the SIMT spelling remains clean and natural.

Costs

  • Only one flow path is executed at a time, and threads not running it must wait.SIMT divergence Ultimately SIMT executes a single instruction in all the multiple threads it runs - threads share program memory and fetch / decode / execute logic. When the threads have the same flow - all run the if, nobody runs the else, for example - then they just all run the code in if at full throughput. However, when one or more threads have to run the else, they'll wait for the if threads. When the if threads are done, they'll in turn wait for the else threads. Divergence is handled correctly, but slowly. Deeply nested control structures effectively serialize execution and are not recommended.
  • Divergence can further slow things down through "randomizing" memory access. In our example, all threads read vec[i], and the indexing is tweaked to avoid contentions. However, when myind[last] is written, different threads will have incremented the last counter a different number of times, depending on the flow. This might lead to contentions which also serialize execution to some extent. Whether the whole parallelization exercise is worth the trouble depends on the flow of the algorithm as well as the input data.

We've seen the differences between SIMT and its less flexible relative, SIMD. We'll now compare SIMT to SMT - the other related model, this time the more flexible one.

SIMT vs SMT

SIMT and SMT both use threads as a way to improve throughput despite high latencies. The problem they tackle is that any single thread can get stalled running a high-latency instruction. This leaves the processor with idle execution hardware.

One way around this is switching to another thread - which (hopefully) has an instruction ready to be executed - and then switch back. For this to work, context switching has to be instantaneous. To achieve that, you replicate register files so that each thread has its own registers, and they all share the same execution hardware.

But wait, doesn't SIMT already replicate registers, as a way to have a single instruction operate on different data items? It does - here, we're talking about a "second dimension" of register replication:

  1. Several threads - a "warp" in NVIDIA terminology - run simultaneously. So each thread needs its own registers.
  2. Several warps, making up a "block", are mapped to an SM, and an SM instantaneously switches between the warps of a block. So each warp needs separate registers for each of its threads.

SIMT 2D replication

With this "two-dimensional" replication, how many registers we end up with? Well, a lot. A block can have up to 512 threads. And the registers of those threads can keep up to 16K of data.

How many register sets does a typical SMT processor have? Er, 2, sometimes 4…

Why so few? One reason is diminishing returns. When you replicate registers, you pay a significant price, in the hope of being able to better occupy your execution hardware. However, with every thread you add, the chance of it being already occupied by all the other threads rises. Soon, the small throughput gain just isn't worth the price.

If SMT CPU designers stop at 2 or 4 threads, why did SIMT GPU designers go for 512?

With enough threads, high throughput is easy

SMT is an afterthought - an attempt to use idle time on a machine originally designed to not have a lot of idle time to begin with. The basic CPU design aims, first and foremost, to run a single thread fast. Splitting a process to several independent threads is not always possible. When it is possible, it's usually gnarly.

Even in server workloads, where there's naturally a lot of independent processing, single-threaded latency still matters. So few expensive, low-latency cores outperform many cheap, high-latency cores. As Google's Urs Hölzle put it, "brawny cores beat wimpy cores". Serial code has to run fast.

Running a single thread fast means being able to issue instructions from the current thread as often as possible. To do that, CPU hardware works around every one of the many reasons to wait. Such diverse techniques as:

  • superscalar execution
  • out-of-order execution
  • register renaming
  • branch prediction
  • speculative execution
  • cache hierarchy
  • speculative prefetching
  • etc. etc.

…are all there for the same basic purpose. They maximize the chances of an instruction to be issued without having to switch to another thread.

SMT is the last line of defense, attempting to fill stalls after all these other measures failed. And even that is considered a bad idea when it hurts the precious single-threaded performance. Which it usually does: each of the 2 threads will typically complete later then it would if it didn't have to share the hardware with the other. This is a key reason to keep the number of hardware threads low, even if there are still throughput gains to be made by adding threads.

However, for the GPUs, the use case is when you do have enough data parallelism to make use of plenty of threads. If so, why not build things the other way around? Threading could be our first stall-mitigating measure. If we have enough threads, we just keep switching between them, and the hardware always has something to do.

SIMT/SMT latency

This saves a lot of hardware, and a lot of design effort, because you don't need most of the other methods anymore. Even caches and hardware prefetching are not used in much of the GPU memory traffic - rather, you access external memory directly. Why bother with caching and prefetching, if you don't have to sit idly until the data arrives from main memory - but instead just switch to a different warp? No heuristics, no speculation, no hurry - just keep yourself busy when waiting.

Furthermore, even the basic arithmetic pipeline is designed for a high latency, high throughput scenario. According to the paper "Demystifying GPU architecture through microbenchmarking", no operation takes less than 24 cycles to complete. However, the throughput of many operations is single-cycle.

The upshot is that counting on the availability of many threads allows the GPU to sustain a high throughput without having to sweat for low latencies. Hardware becomes simpler and cheaper in many areas as a result.

When latencies are high, registers are cheap

So plentiful threads make it easier to build high-throughput hardware. What about having to replicate all those registers though? 16K sounds like an insane amount of registers - how is this even affordable?

Well, it depends on what "register" means. The original meaning is the kind of storage with the smallest access latency. In CPUs, access to registers is faster than access to L1 caches, which in turn is faster than L2, etc. Small access latency means an expensive implementation, therefore there must be few registers.

However, in GPUs, access to "registers" can be quite slow. Because the GPU is always switching between warps, many cycles pass between two subsequent instructions of one warp. The reason registers must be fast in CPUs is because subsequent instructions communicate through them. In GPUs, they also communicate through registers - but the higher latency means there's no rush.

Therefore, GPU registers are only analogous to CPU registers in terms of instruction encoding. In machine code, "registers" are a small set of temporary variables that can be referenced with just a few bits of encoding - unlike memory, where you need a longer address to refer to a variable. In this and other ways, "registers" and "memory" are semantically different elements of encoding - both on CPUs and GPUs.

However, in terms of hardware implementation, GPU registers are actually more like memory than CPU registers. [Disclaimer: NVIDIA doesn't disclose implementation details, and I'm grossly oversimplifying, ignoring things like data forwarding, multiple access ports, and synthesizable vs custom design]. 16K of local RAM is a perfectly affordable amount. So while in a CPU, registers have to be expensive, they can be cheap in a high-latency, high-throughput design.

It's still a waste if 512 threads keep the same values in some of their registers - such as array base pointers in our examples above. However, many of the registers keep different values in different threads. In many cases register replication is not a waste at all - any processor would have to keep those values somewhere. So functionally, the plentiful GPU registers can be seen as a sort of a data cache.

Drawbacks

We've seen that:

  • Many threads enable cheap high-throughput, high-latency design
  • A high-throughput, high-latency design in turn enables a cheap implementation of threads' registers

This leads to a surprising conclusion that SIMT with its massive threading can actually be cheaper than SMT-style threading added to a classic CPU design. Not unexpectedly, these cost savings come at a price of reduced flexibility:

  1. Low occupancy greatly reduces performance
  2. Flow divergence greatly reduces performance
  3. Synchronization options are very limited

Occupancy

"Occupancy" is NVIDIA's term for the utilization of threading. The more threads an SM runs, the higher its occupancy. Low occupancy obviously leads to low performance - without enough warps to switch between, the GPU won't be able to hide its high latencies. The whole point of massive threading is refusing to target anything but massively parallel workloads. SMT requires much less parallelism to be efficient.

Divergence

We've seen that flow divergence is handled correctly, but inefficiently in SIMT. SMT doesn't have this problem - it works quite well given unrelated threads with unrelated control flow.

There are two reasons why unrelated threads can't work well with SIMT:

  • SIMD-style instruction broadcasting - unrelated threads within a warp can't run fast.
  • More massive threading than SMT - unrelated wraps would compete for shared resources such as instruction cache space. SMT also has this problem, but it's tolerable when you have few threads.

So both of SIMT's key ideas - SIMD-style instruction broadcasting and SMT-style massive threading - are incompatible with unrelated threads.

Related threads - those sharing code and some of the data - could work well with massive threading by itself despite divergence. It's instruction broadcasting that they fail to utilize, leaving execution hardware in idle state.

However, it seems that much of the time, related threads actually tend to have the same flow and no divergence. If this is true, a machine with massive threading but without instruction broadcasting would miss a lot of opportunities to execute its workload more efficiently.

Synchronization

In terms of programming model, SMT is an extension to a single-threaded, time-shared CPU. The same fairly rich set of inter-thread (and inter-device) synchronization and communication options is available with SMT as with "classic" single-threaded CPUs. This includes interrupts, message queues, events, semaphores, blocking and non-blocking system calls, etc. The underlying assumptions are:

  • There are quite many threads
  • Typically, each thread is doing something quite different from other threads
  • At any moment, most threads are waiting for an event, and a small subset can actually run

SMT stays within this basic time-sharing framework, adding an option to have more than one actually running threads. With SMT, as with a "classic" CPU, a thread will be very typically put "on hold" in order to wait for an event. This is implemented using context switching - saving registers to memory and, if a ready thread is found, restoring its registers from memory so that it can run.

SIMT doesn't like to put threads on hold, for several reasons:

  • Typically, there are many running, related threads. It would make the most sense to put them all on hold, so that another, unrelated, equally large group of threads can run. However, switching 16K of context is not affordable. In this sense, "registers" are expensive after all, even if they are actually memory.
  • SIMT performance depends greatly on there being many running threads. There's no point in supporting the case where most threads are waiting, because SIMT wouldn't run such workloads very well anyway. From the use case angle, a lot of waiting threads arise in "system"/"controller" kind of software, where threads wait for files, sockets, etc. SIMT is purely computational hardware that doesn't support such OS services. So the situation is both awkward for SIMT and shouldn't happen in its target workloads anyway.
  • Roughly, SIMT supports data parallelism - same code, different data. Data parallelism usually doesn't need complicated synchronization - all threads have to synchronize once a processing stage is done, and otherwise, they're independent. What requires complicated synchronization, where some threads run and some are put on hold due to data dependencies, is task parallelism - different code, different data. However, task parallelism implies divergence, and SIMT isn't good at that anyway - so why bother with complicated synchronization?

Therefore, SIMT roughly supports just one synchronization primitive - __syncthreads(). This creates a synchronization point for all the threads of a block. You know that if a thread runs code past this point, no thread runs code before this point. This way, threads can safely share results with each other. For instance, with matrix multiplication:

  1. Each thread, based on its ID - x,y - reads 2 elements, A(x,y) and B(x,y), from external memory to the on-chip shared memory. (Of course large A and B won't fit into shared memory, so this will be done block-wise.)
  2. Threads sync - all threads can now safely access all of A and B.
  3. Each thread, depending on the ID, multiplies row y in A by column x in B.
//1. load A(x,y) and B(x,y)
int x = threadIdx.x;
int y = threadIdx.y;
A[stride*y + x] = extA[ext_stride*y + x];
B[stride*y + x] = extB[ext_stride*y + x];
//2. sync
__syncthreads();
//3. multiply row y in A by column x in B
float prod = 0;
for(int i=0; i<N; ++i) {
  prod += A[stride*y + i] * B[stride*i + x];
}

It's an incomplete example (we look at just one block and ignore blockIdx, among other thing), but it shows the point of syncing - and the point of these weird "multi-dimensional" thread IDs (IDs are x,y,z coordinates rather than just integers). It's just natural, with 2D and 3D arrays, to map threads and blocks to coordinates and sub-ranges of these arrays.

Summary of differences between SIMD, SIMT and SMT

SIMT is more flexible in SIMD in three areas:

  1. Single instruction, multiple register sets
  2. Single instruction, multiple addresses
  3. Single instruction, multiple flow paths

SIMT is less flexible than SMT in three areas:

  1. Low occupancy greatly reduces performance
  2. Flow divergence greatly reduces performance
  3. Synchronization options are very limited

The effect of flexibility on costs was discussed above. A wise programmer probably doesn't care about these costs - rather, he uses the most flexible (and easily accessible) device until he runs out of cycles, then moves on to utilize the next most flexible device. Costs are just a limit on how flexible a device that is available in a given situation can be.

"SIMT" should catch on

It's a beautiful idea, questioning many of the usual assumptions about hardware design, and arriving at an internally consistent answer with the different parts naturally complementing each other.

I don't know the history well enough to tell which parts are innovations by NVIDIA and which are borrowed from previous designs. However, I believe the term SIMT was coined by NVIDIA, and perhaps it's a shame that it (apparently) didn't catch, because the architecture "deserves a name" - not necessarily true of every "new paradigm" announced by marketing.

One person who took note is Andy Glew, one of Intel's P6 architects - in his awesome computer architecture wiki, as well as in his presentation regarding further development of the SIMT model.

The presentation talks about neat optimizations - "rejiggering threads", per-thread loop buffers/time pipelining - and generally praises SIMT superiority over SIMD. Some things I disagree with - such as the "vector lane crossing" issue - and some are very interesting, such as everything about improving utilization of divergent threads.

I think the presentation should be understandable after reading my oversimplified overview - and will show where my overview oversimplifies, among other things.

Peddling fairness

Throughout this overview, there's this recurring "fairness" idea: you can trade flexibility for performance, and you can choose your trade-off.

It makes for a good narrative, but I don't necessarily believe it. You might very well be able to get both flexibility and performance. More generally, you might get both A and B, where A vs B intuitively feels like "an inherent trade-off".

What you can't do is get A and B in a reasonably simple, straightforward way. This means that you can trade simplicity for almost everything - though you don't necessarily want to; perhaps it's a good subject for a separate post.

An unusual hardware architecture: APA (Associative Processing Array)

We're living in the golden age of hardware design - it probably won't get any cheaper to make a new chip. Yes, there's the downside that it's harder to be wildly incompatible with everything than in the old days - there's plenty of standard interfaces to support. On the other hand, it's a benefit, not just a drawback - make a chip with a CPU that runs C, an Ethernet controller and a DRAM interface, and it's now usable with plenty of software and hardware developed by others.

And then for the wilder, "incompatible" innovations - if you really want to invent interfaces and not just implementations - there's the "accelerator" realm: an MPEG decoder, a DSP, a GPU. A decade ago, they said that "standard" microprocessors killed all high-performance architectures. Today, a system that kept the "fastest computer in the world" title for almost a year is based on GPUs - and "non-standard"/"incompatible" accelerators are everywhere (not to mention SIMD instruction set extensions right inside the otherwise "compatible"/"commodity" CPUs).

Still, while it indeed became way cheaper to make your own chips since you don't have to erect your own fab to do that, "cheap" still means costs somewhere in the millions - not quite what the word "cheap" brings to mind. And mistakes still can't be corrected, not really. Which results, among other things, in an understandable risk aversion with respect to design.

A language geek, who's naturally curious about programming languages that don't look mainstream (C with classes), is going to meet many such languages, and plenty of implementations to play with. A hardware geek, who's equally curious about designs that don't look mainstream (RISC/VLIW with bells and whistles), is going to find few such designs and very few implementations. Risk aversion is innovation aversion.

APA thus combines two traits that are rare - it gets at least 9 out of 10 in the "non-mainstream" category, and implementations were actually manufactured and shipped, specifically, in early smartphones by NeoMagic (BDTI and EETimes articles are some of the architecture overviews). The phones apparently weren't very successful, but it seems a poor measure of the architecture's merit in this case, just because there are so many ways to fail regardless of your accelerator quality. So the platform's demise gives us not evidence but a lack of evidence (the platform wasn't widely targeted by 3rd party developers whose opinions could be interesting).

Overview

APA stands for Associative Processing Array, a kind of content-addressable memory. Just memory. How do you compute using just memory? APA (Associative Processing Array)The operations are done right there, near the cells where your bits are kept. No need to read your data word by word into a processor, than write it back - rather, your operations run in parallel, inside the memory, on every data word! SIMD operations on steroids.

What operations? Bitwise operations - masked compare and masked write:

  • Compare: tag[i] = (word[i]&mask)==(compare_data&mask)
  • Write: if(tag[i]) word[i]=(word[i]&mask) | (write_data&~mask)

…where a tag bit is kept for every data word (row), and mask & compare/write data are broadcasted to all rows. You can also move words - either to the adjacent rows or to rows at distance 8 (or, say, 16, depending on the hardware configuration - but not at an arbitrary distance). And you can move the data in or out, sequentially.

Where's the code - who decides what to write, compare and move in what order? One option is to use a low-end CPU running "normal" code to control the contents of the compare/write data & mask registers, to issue APA operations and to interact with the host system (the "real" CPU and external memory). This little processor isn't "accelerating" computations by itself, just controls the APA accelerator hardware.

That's it. No add, no multiply, no nothing. How's that for non-mainstream?

So, how do you actually do anything with this SIMD-on-steroids machine - say, add two vectors of numbers? Well, you need to have both vectors in the array - each element pair occupying some of the bits of a row (for instance, the NeoMagic 512-row APA had 160 bits per row, so for 16 bit vectors, you'd use 32 bits out of 160). Then you perform the addition bit by bit, using masked compare and write operations.

That's right, addition done entirely in software. Latency: 48 cycles for 16 bits, throughput: >10 16b additions/cycle for NeoMagic's 512-row APA. More latency for 32 bits, less for 8 bits, still less for 3 bits. Which means optimization opportunities you're not used to having in software (you gain no speed-up using only 3 bits of a 32b CPU register for addition). On the other hand, it means large penalties for high precision (floating point takes thousands of cycles).

Tools and libraries would be supplied with an APA implementation, but the exciting thought for a programmer is to be able to bypass them where needed and get down to bit-level manipulation! (The exciting thought for a decision maker would be to always use the tools, never pay someone for the bit fiddling; well, different people get their excitement from different things.)

Pros & cons

In a way, it's an exceedingly generic and elegant architecture, and a very impressive one. One "gets it" like one doesn't get any other. Just try to describe any other sort of programmable machine to a level of detail sufficient for accurate predictions about performance.

Surprisingly - or, perhaps, unsurprisingly given just how unusual this machine is - I simultaneously also don't get it like any other. I'm not used to thinking of computations that way, both in terms of precision and in terms of data access. I have no idea what share of my 8-bit numbers only have just 5 or 2 bits and it can matter a lot here.

Likewise, even for algorithms that don't require random access - and APA is simply not good for random access - I have little idea of just how non-random my access is. That is, similarly to the precision case where constants I'm not used to care about matter, there's a difference between accessing a row at distance 1 and a row at distance 5, and I don't know how frequent different distances between adjacent things are in a code base I care about.

So while in the abstract, a shorter than ever architecture description is sufficient here for accurate performance predictions, I lack intuition and experience that would make such predictions easy.

A great thing about the APA design is its "hardware friendliness". The typical processor design involves a truckload of convoluted circuits that are likeable for their function, but unimaginable to the human mind as physical objects - so cumbersome, infinitely configurable, finicky tools are used to translate a functional description of said circuits to actual electric circuits occupying actual space.

The typical hardware design is not hardware friendly (sounds strange - but the average piece of software isn't very considerate towards its target processor and software stack, either). The APA, on the other hand, has a natural spatial mapping with its 2 dimensions (row bit width and the number of rows) and regular connections between just the adjacent or relatively close rows.

Again, surprisingly - or unsurprisingly given how unusual the APA is - this is also a drawback, in the sense that the standard hardware design tools will do a very poor job implementing APA. For that matter, standard tools will likely do an even worse job implementing plain RAM - which also has a natural spatial mapping and regular structure.

RAM is analog design, not digital: rather than describing it on a functional level, someone implements it as a physical circuit description. Basically, RAM is a library for digital tools, like flip-flops or simple gates - a library that can't be implemented using the tool itself.

I really appreciate the von Neumann architecture, but it's scary to realize just how strong the von Neumann grip really is. RAM is basically the only interface to a large array of bits that is relatively easily accessible to a hardware designer. (Not that it's very easily accessible, mind you - there is no portable interface for RAM, believe it or not, but no matter how you build your chips, you'll be able to get some sort of RAM).

What happens if you want your own bag of bits instead of RAM? You can do what they call full custom design - make your own physical circuit description. This isn't "portable", where "porting" means changing the manufacturing process - either to move from 65nm to 40nm, or from one manufacturer to another. It's also a longer and more complicated process. But it's certainly doable, especially if you outsource it to the inventors, as you'd have to do anyway because the thing is patented.

The hostility to hardware design tools (despite the friendliness to the basic constraints of hardware) and patent protection make APA more of a possible off-the-shelf solution than inspiring source of ideas, and so does the design simplicity. Unlike, say, VLIW, which is a design style with basic ideas in the public domain and countless possible variations and extensions (starting with the "let's add this one spiffy instruction" variety), APA is more or less complete. There are important constants to tweak - row width and the distance to easily accessible neighbors - but while it could be a hard choice to make, it's not very creative. (I do believe it could be a source of ideas if only through expanding one's horizons, which is why I write about it.)

If we attempt to discuss the efficiency of APA qualitatively, in terms of hardware resource utilization rather than throughput per mm^2 or per mW for a given app, three things come to mind:

  • Benefit - a perennial bottleneck of accessing memory through a bus very narrow compared to the amount of bits stored is eliminated. This "ought to be good" for algorithms where access is "far from random" since these gain nothing from conventional RAM's flexibility but pay the full price of its low throughput.
  • Drawback - the cycle is "wasted": values are read from flip-flops, undergo very few transformations and are written back. This "ought to be bad" because it won't translate to high frequency (you can't read then write a flip-flop very fast - and doing this with them all at once dissipates power, so in fact low frequencies enabled by parallelism, not high frequencies enabled by circuit simplicity are APA's potential advantage in the frequency department). A competing design running at a similar frequency, however, will do much more per cycle (like actual addition) because circuits implementing combinatorial logic are fast. The question thus becomes how big those circuits are compared to flip-flops: if enough APA rows can be packed instead, the throughput will still be competitive despite the abysmal latency - provided that you simultaneously process sufficiently large amounts of data.
  • Benefit - ability to use DRAM cells. This "ought to be good" since DRAM cells are much smaller than normal flip-flops (which is achieved through having them leak their charge so they have to be periodically refreshed). AFAIK, nobody uses them for computation directly because they don't fit in traditional hardware models - neither as registers (that's plain dumb) nor as local RAM (they have poor latency and imply a cumbersome controller to access as a RAM). An APA, on the other hand, could potentially run well on DRAM cells if the required simple circuitry were implemented near the cells. One problem with this is that custom chips may be easy to make these days, but not custom DRAM chips. In particular, Andy Glew, formerly from Intel then AMD, said at some place that it's "hard to influence DRAM makers", and if it's hard for Intel or AMD, well, it's probably hard in general.

Overall, I have a lot of reservations here - not just because of the way originality by itself seems to complicate matters here (I'd hate to admit it as the only reason…), but because it's a step in the opposite direction to what successful SIMD systems are doing. That is, you get high throughput given unusually low precision and unusually restricted data access patterns. A DSP from the late 90s would attempt to process numbers with more bits and would let you fetch them from more places. A GPU from the 2000s still more so - floating point numbers, parallel random access with transparent contention handling (at least in CUDA GPUs). It seems that the direction is removing restrictions on the set of things you run fast, not very high throughput for a very restricted set.

On the other hand, there exist algorithms with low precision and high locality. And it's exciting to see an architecture which is not RAM-based, naturally represented in space, etc. - all those things which get people excited in hardware discussions - but practical enough for a real world delivery, and with some things connecting it to more usual programming models (for instance, SIMD commands broadcasted from a CPU instead of clever local rules you have no idea how to come up with as in cellular automata). So it's definitely very interesting.

Update: as a better informed commenter pointed out, APA is also known as CAPP - Content Addressable Parallel Processor, and implementations date as early as 1972 (which makes you wonder what could legitimately remain outside the public domain by now). This seems an evidence of having failed the test of time - or having some really deep trouble with commercialization. I'd be curious to hear a software developer's experience with this - BDTI's article from 2003 talked about the development experience in hypothetical terms and entirely in the future tense, whereas some past evidence has to be available.

My history with Forth & stack machines

My VLSI tools take a chip from conception through testing. Perhaps 500 lines of source code. Cadence, Mentor Graphics do the same, more or less. With how much source/object code?

Chuck Moore, the inventor of Forth

This is a personal account of my experience implementing and using the Forth programming language and the stack machine architecture. "Implementing and using" - in that order, pretty much; a somewhat typical order, as will become apparent.

It will also become clear why, having defined the instruction set of a processor designed to run Forth that went into production, I don't consider myself a competent Forth programmer (now is the time to warn that my understanding of Forth is just that - my own understanding; wouldn't count on it too much.)

Why the epigraph about Chuck Moore's VLSI tools? Because Forth is very radical. Black Square kind of radical. An approach to programming seemingly leaving out most if not all of programming:

…Forth does it differently. There is no syntax, no redundancy, no typing. There are no errors that can be detected. …there are no parentheses. No indentation. No hooks, no compatibility. …No files. No operating system.

Black Square by Kazimir Malevich

I've never been a huge fan of suprematism or modernism in general. However, a particular modernist can easily get my attention if he's a genius in a traditional sense, with superpowers. Say, he memorizes note sheets upon the first brief glance like Shostakovich did.

Now, I've seen chip design tools by the likes of Cadence and Mentor Graphics. Astronomically costly licenses. Geological run times. And nobody quite knows what they do. To me, VLSI tools in 500 lines qualify as a superpower, enough to grab my attention.

So, here goes.

***

I was intrigued with Forth ever since I read about it in Bruce Eckel's book on C++, a 198-something edition; he said there that "extensibility got a bad reputation due to languages like Forth, where a programmer could change everything and effectively create a programming language of his own". WANT!

A couple of years later, I looked for info on the net, which seemed somewhat scarce. An unusually looking language. Parameters and results passed implicitly on a stack. 2 3 + instead of 2+3. Case-insensitive. Nothing about the extensibility business though.

I thought of nothing better than to dive into the source of an implementation, pForth - and I didn't need anything better, as my mind was immediately blown away by the following passage right at the top of system.fth, the part of pForth implemented in Forth on top of the C interpreter:

: (   41 word drop ; immediate
( That was the definition for the comment word. )
( Now we can add comments to what we are doing! )

Now. we. can. add. comments. to. what. we. are. doing.

What this does is define a word (Forth's name for a function) called "(". "(" is executed at compile time (as directed by IMMEDIATE). It tells the compiler to read bytes from the source file (that's what the word called, um, WORD is doing), until a ")" - ASCII 41 - is found. Those bytes are then ignored (the pointer to them is removed from the stack with DROP). So effectively, everything inside "( … )" becomes a comment.

Wow. Yeah, you definitely can't do that in C++. (You can in Lisp but they don't teach you those parts at school. They teach the pure functional parts, where you can't do things that you can in C++. Bastards.)

Read some more and…

\ conditional primitives
: IF     ( — f orig )  ?comp compile 0branch  conditional_key >mark     ; immediate
: THEN   ( f orig — )  swap ?condition  >resolve   ; immediate
: BEGIN  ( — f dest )  ?comp conditional_key <mark   ; immediate
: AGAIN  ( f dest — )  compile branch  swap ?condition  <resolve  ; immediate
: UNTIL  ( f dest — )  compile 0branch swap ?condition  <resolve  ; immediate
: AHEAD  ( — f orig )  compile branch   conditional_key >mark     ; immediate

Conditional primitives?! Looks like conditional primitives aren't - they define them here. This COMPILE BRANCH business modifies the code of a function that uses IF or THEN, at compile time. THEN - one part of the conditional - writes (RESOLVEs) a branch offset to a point in code saved (MARKed) by IF, the other part of the conditional.

It's as if a conventional program modified the assembly instructions generated from it at compile time. What? How? Who? How do I wrap my mind around this?

Shocked, I read the source of pForth.

Sort of understood how Forth code was represented and interpreted. Code is this array of "execution tokens" - function pointers, numbers and a few built-ins like branches, basically. A Forth interpreter keeps an instruction pointer into this array (ip), a data stack (ds), and a return stack (rs), and does this:

while(true) {
 switch(*ip) {
  //arithmetics (+,-,*…):
  case PLUS: ds.push(ds.pop() + ds.pop()); ++ip;
  //stack manipulation (drop,swap,rot…):
  case DROP: ds.pop(); ++ip;
  //literal numbers (1,2,3…):
  case LITERAL: ds.push(ip[1]); ip+=2;
  //control flow:
  case COND_BRANCH: if(!ds.pop()) ip+=ip[1]; else ip+=2;
  case RETURN: ip = rs.pop();
  //user-defined words: save return address & jump
  default: rs.push(ip+1); ip = *ip;
 }
}

That's it, pretty much. Similar, say, to the virtual stack machine used to implement Java. One difference is that compiling a Forth program is basically writing to the code array in a WYSIWYG fashion. COMPILE SOMETHING simply appends the address of the word SOMETHING to the end of the code array. So does plain SOMETHING when Forth is compiling rather than interpreting, as it is between a colon and a semicolon, that is, when a word is defined.

So

: DRAW-RECTANGLE 2DUP UP RIGHT DOWN LEFT ;

simply appends {&2dup,&up,&right,&down,&left,RETURN} to the code array. Very straightforward. There are no parameters or declaration/expression syntax as in…

void drawRectangle(int width, int height) {
  up(height);
  right(width);
  down(height);
  left(width);
}

…to make it less than absolutely clear how the source code maps to executable code. "C maps straightforwardly to assembly"? Ha! Forth maps straightforwardly to assembly. Well, to the assembly language of a virtual stack machine, but still. So one can understand how self-modifying code like IF and THEN works.

On the other hand, compared to drawRectangle, it is somewhat unclear what DRAW-RECTANGLE does. What are those 2 values on the top of the stack that 2DUP duplicates before meaningful English names appear in DRAW-RECTANGLE's definition? This is supposed to be ameliorated by stack comments:

: DRAW-RECTANGLE ( width height — ) … ;

…tells us that DRAW-RECTANGLE expects to find height at the top of the stack, and width right below it.

I went on to sort of understand CREATE/DOES> - a further extension of this compile-time self-modifying code business that you use to "define defining words" (say, CONSTANT, VARIABLE, or CLASS). The CREATE part says what should be done when words (say, class names) are defined by your new defining word. The DOES> part says what should be done when those words are used. For example:

: CONSTANT
   CREATE ,
   DOES> @
;
\ usage example:
7 CONSTANT DAYS-IN-WEEK
DAYS-IN-WEEK 2 + . \ should print 9

CREATE means that every time CONSTANT is called, a name is read from the source file (similarly to what WORD would have done). Then a new word is created with that name (as a colon would have done). This word records the value of HERE - something like sbrk(0), a pointer past the last allocated data item. When the word is executed, it pushes the saved address onto the data stack, then calls the code after DOES>. The code after CREATE can put some data after HERE, making it available later to the DOES> part.

With CONSTANT, the CREATE part just saves its input (in our example, 7) - the comma word does this: *HERE++ = ds.pop(); The DOES> part then fetches the saved number - the @ sign is the fetch word: ds.push( *ds.pop() );

CONSTANT works somewhat similarly to a class, CREATE defining its constructor and DOES> its single method:

class Constant
  def initialize(x) @x=x end
  def does() @x end
end
daysInWeek = Constant.new(7)
print daysInWeek.does() + 2

…But it's much more compact on all levels.

Another example is defining C-like structs. Stripped down to their bare essentials (and in Forth things tend to be stripped down to their bare essentials), you can say that:

struct Rectangle {
  int width;
  int height;
};

…simply gives 8 (the structure size) a new name Rectangle, and gives 0 and 4 (the members' offsets) new names, width and height. Here's one way to implement structs in Forth:

struct
  cell field width
  cell field height
constant rectangle

\ usage example:
\ here CREATE is used just for allocation
create r1 rectangle allot \ r1=HERE; HERE+=8
2 r1 width !
3 r1 height !
: area dup width @ swap height @ * ;
r1 area . \ should print 6

CELL is the size of a word; we could say "4 field width" instead of "cell field width" on 32b machines. Here's the definition of FIELD:

 : field ( struct-size field-size — new-struct-size )
    create over , +
    does> @ +
 ;

Again, pretty compact. The CREATE part stores the offset, a.k.a current struct size (OVER does ds.push(ds[1]), comma does *HERE++=ds.pop()), then adds the field size to the struct size, updating it for the next call to FIELD. The DOES> part fetches the offset, and adds it to the top of the stack, supposedly containing the object base pointer, so that "rect width" or "rect height" compute &rect.width or &rect.height, respectively. Then you can access this address with @ or ! (fetch/store). STRUCT simply pushes 0 to the top of the data stack (initial size value), and at the end, CONSTANT consumes the struct size:

struct \ data stack: 0
  cell ( ds: 0 4 ) field width  ( ds: 4 )
  cell ( ds: 4 4 ) field height ( ds: 8 )
constant rectangle ( ds: as before STRUCT )

You can further extend this to support polymorphic methods - METHOD would work similarly to FIELD, fetching a function pointer ("execution token") through a vtable pointer and an offset kept in the CREATEd part. A basic object system in Forth can thus be implemented in one screen (a Forth code size unit - 16 lines x 64 characters).

To this day, I find it shocking that you can define defining words like CONSTANT, FIELD, CLASS, METHOD - something reserved to built-in keywords and syntactic conventions in most languages - and you can do it so compactly using such crude facilities so trivial to implement. Back when I first saw this, I didn't know about DEFMACRO and how it could be used to implement the defining words of CLOS such as DEFCLASS and DEFMETHOD (another thing about Lisp they don't teach in schools). So Forth was completely mind-blowing.

And then I put Forth aside.

It seemed more suited for number crunching/"systems programming" than text processing/"scripting", whereas it is scripting that is the best trojan horse for pushing a language into an organization. Scripting is usually mission-critical without being acknowledged as such, and many scripts are small and standalone. Look how many popular "scripting languages" there are as opposed to "systems programming languages". Then normalize it by the amount of corporate backing a language got on its way to popularity. Clearly scripting is the best trojan horse.

In short, there were few opportunities to play with Forth at work, so I didn't. I fiddled with the interpreter and with the metaprogramming and then left it at that without doing any real programming.

Here's what Jeff Fox, a prominent member of the Forth community who've worked with Chuck Moore for years, has to say about people like me:

Forth seems to mean programming applications to some and porting Forth or dissecting Forth to others. And these groups don't seem to have much in common.

…One learns one set of things about frogs from studying them in their natural environment or by getting a doctorate in zoology and specializing in frogs. And people who spend an hour dissecting a dead frog in a pan of formaldehyde in a biology class learn something else about frogs.

…One of my favorite examples was that one notable colorforth [a Forth dialect] enthusiast who had spent years studying it, disassembling it, reassembling it and modifying it, and made a lot of public comments about it, but had never bothered running it and in two years of 'study' had not been able to figure out how to do something in colorforth as simple as:

1 dup +

…[such Forth users] seem to have little interest in what it does, how it is used, or what people using it do with it. But some spend years doing an autopsy on dead code that they don't even run.

Live frogs are just very different than dead frogs.

Ouch. Quite an assault not just on a fraction of a particular community, but on language geeks in general.

I guess I feel that I could say that if it isn't solving a significant real problem in the real world it isn't really Forth.

True, I guess, and equally true from the viewpoint of someone extensively using any non-mainstream language and claiming enormous productivity gains for experts. Especially true for the core (hard core?) of the Forth community, Forth being their only weapon. They actually live in Forth; it's DIY taken to the extreme, something probably unparalleled in the history of computing, except, perhaps, the case of Lisp environments and Lisp machines (again).

Code running on Forth chips. Chips designed with Forth CAD tools. Tools developed in a Forth environment running on the bare metal of the desktop machine. No standard OS, file system or editor. All in recent years when absolutely nobody else would attempt anything like it. They claim to be 10x to 100x more productive than C programmers (a generic pejorative term for non-Forth programmers; Jeff Fox is careful to put "C" in quotes, presumably either to make the term more generic or more pejorative).

…people water down the Forth they do by not exercising most of the freedom it offers… by using Forth only as debugger or a yet another inefficient scripting language to be used 1% of the time.

Forth is about the freedom to change the language, the compiler, the OS or even the hardware design and is very different than programming languages that are about fitting things to a fixed language syntax in a narrow work context.

What can be said of this? If, in order to "really" enter a programming culture, I need to both "be solving a significant real problem in the real world" and exercising "the freedom to change the language, the compiler, the OS or even the hardware design", then there are very few options for entering this culture indeed. The requirement for "real world work" is almost by definition incompatible with "the freedom to change the language, the compiler, the OS and the hardware design".

And then it so happened that I started working on a real-world project about as close to Forth-level DIY as possible. It was our own hardware, with our own OS, our own compilers, designed to be running our own application. We did use standard CAD tools, desktop operating systems and editors, and had standard RISC cores in the chip and standard C++ cross compilers for them. Well, everyone has weaknesses. Still, the system was custom-tailored, embedded, optimized, standalone, with lots of freedom to exercise - pretty close to the Forth way, in one way.

One part of the system was an image processing co-processor, a variation on the VLIW theme. Its memory access and control flow was weird and limited, to the effect that you could neither access nor jump to an arbitrary memory address. It worked fine for the processing-intensive parts of our image processing programs.

We actually intended to glue those parts together with a few "control instructions" setting up the plentiful control registers of this machine. When I tried, it quickly turned out, as was to be expected, that those "control instructions" must be able to do, well, everything - arithmetic, conditions, loops. In short, we needed a CPU.

We thought about buying a CPU, but it was unclear how we could use an off-the-shelf product. We needed to dispatch VLIW instructions from the same instruction stream. We also needed a weird mixture of features. No caches, no interrupts, no need for more than 16 address bits, but for accessing 64 data bits, and 32-bit arithmetic.

We thought about making our own CPU. The person with the overall responsibility for the hardware design gently told me that I was out of my mind. CPUs have register files and pipeline and pipeline stalls and dependency detection to avoid those stalls and it's too complicated.

And then I asked, how about a stack machine? No register file. Just a 3-stage pipeline - fetch, decode, execute. No problem with register dependencies, always pop inputs from the top of the stack, push the result.

He said it sounded easy enough alright, we could do that. "It's just like my RPN calculator. How would you program it?" "In Forth!"

I defined the instruction set in a couple of hours. It mapped to Forth words as straightforwardly as possible, plus it had a few things Forth doesn't have that C might need, as a kind of insurance (say, access to 16-bit values in memory).

This got approved and implemented; not that it became the schedule bottleneck, but it was harder than we thought. Presumably that was partly the result of not reading "Stack Computers: the new wave", and not studying the chip designs of Forth's creator Chuck Moore, either. I have a feeling that knowledgable people would have sneered at this machine: it was trivial to compile Forth to it, but at the cost of complicating the hardware.

But I was satisfied - I got a general-purpose CPU for setting up my config regs at various times through my programs, and as a side effect, I got a Forth target. And even if it wasn't the most cost-effective Forth target imaginable, it was definitely a time to start using Forth at work.

(Another area of prior art on stack machines that I failed to study in depth was 4stack - an actual VLIW stack machine, with 4 data stacks as suggested by its name. I was very interested in it, especially during the time when we feared implementation problems with the multi-port register file feeding our multiple execution units. I didn't quite figure out how programs would map to 4stack and what the efficiency drop would be when one had to spill stuff from the data stacks to other memory because of data flow complications. So we just went for a standard register file and it worked out.)

The first thing I did was write a Forth cross-compiler for the machine - a 700-line C++ file (and for reasons unknown, the slowest-compiling C++ code that I have ever seen).

I left out all of the metaprogramming stuff. For instance, none of the Forth examples above, the ones that drove me to Forth, could be made to work in my own Forth. No WORD, no COMPILE, no IMMEDIATE, no CREATE/DOES>, no nothing. Just colon definitions, RPN syntax, flow control words built into the compiler. "Optimizations" - trivial constant folding so that 1 2 + becomes 3, and inlining - :INLINE 1 + ; works just like : 1 + ; but is inlined into the code of the caller. (I was working on the bottlenecks so saving a CALL and a RETURN was a big deal.) So I had that, plus inline assembly for the VLIW instructions. Pretty basic.

I figured I didn't need the more interesting metaprogramming stuff for my first prototype programs, and I could add it later if it turned out that I was wrong. It was wierd to throw away everything I originally liked the most, but I was all set to start writing real programs. Solving real problems in the real world.

It was among the most painful programming experiences in my life.

All kinds of attempts at libraries and small test programs aside, my biggest program was about 700 lines long (that's 1 line of compiler code for 1 line of application code). Here's a sample function:

: mean_std ( sum2 sum inv_len — mean std )
  \ precise_mean = sum * inv_len;
  tuck u* \ sum2 inv_len precise_mean
  \ mean = precise_mean >> FRAC;
  dup FRAC rshift -rot3 \ mean sum2 inv_len precise_mean
  \ var = (((unsigned long long)sum2 * inv_len) >> FRAC) - (precise_mean * precise_mean >> (FRAC*2));
  dup um* nip FRAC 2 * 32 - rshift -rot \ mean precise_mean^2 sum2 inv_len
  um* 32 FRAC - lshift swap FRAC rshift or \ mean precise_mean^2 sum*inv_len
  swap - isqrt \ mean std
;

Tuck u*.

This computes the mean and the standard deviation of a vector given the sum of its elements, the sum of their squares, and the inverse of its length. It uses scaled integer arithmetic: inv_len is an integer keeping (1<<FRAC)/length. How it arranges the data on the stack is beyond me. It was beyond me at the time when I wrote this function, as indicated by the plentiful comments documenting the stack state, amended by wimpy C-like comments ("C"-like comments) explaining the meaning of the postfix expressions.

This nip/tuck business in the code? Rather than a reference to the drama series on plastic surgery, these are names of Forth stack manipulation words. You can look them up in the standard. I forgot what they do, but it's, like, ds.insert(2,ds.top()), ds.remove(1), this kind of thing.

Good Forth programmers reportedly don't use much of those. Good Forth programmers arrange things so that they flow on the stack. Or they use local variables. My DRAW-RECTANGLE definition above, with a 2DUP, was reasonably flowing by my standards: you get width and height, duplicate both, and have all 4 data items - width,height,width,height - consumed by the next 4 words. Compact, efficient - little stack manipulation. Alternatively we could write:

: DRAW-RECTANGLE { width height }
  height UP
  width RIGHT
  height DOWN
  width LEFT
;

Less compact, but very readable - not really, if you think about it, since nobody knows how much stuff UP leaves on the stack and what share of that stuff RIGHT consumes, but readable enough if you assume the obvious. One reason not to use locals is that Chuck Moore hates them:

I remain adamant that local variables are not only useless, they are harmful.

If you are writing code that needs them you are writing non-optimal code. Don't use local variables. Don't come up with new syntax for describing them and new schemes for implementing them. You can make local variables very efficient especially if you have local registers to store them in, but don't. It's bad. It's wrong.

It is necessary to have [global] variables. … I don't see any use for [local] variables which are accessed instantaneously.

Another reason not to use locals is that it takes time to store and fetch them. If you have two items on a data stack on a hardware stack machine, + will add them in one cycle. If you use a local, then it will take a cycle to store its value with { local_name }, and a cycle to fetch its value every time you mention local_name. On the first version of our machine, it was worse as fetching took 2 cycles. So when I wrote my Forth code, I had to make it "flow" for it to be fast.

The abundance of DUP, SWAP, -ROT and -ROT3 in my code shows that making it flow wasn't very easy. One problem is that every stack manipulation instruction also costs a cycle, so I started wondering whether I was already past the point where I had a net gain. The other problem was that I couldn't quite follow this flow.

Another feature of good Forth code, which supposedly helps achieve the first good feature ("flow" on the stack), is factoring. Many small definitions.

Forth is highly factored code. I don't know anything else to say except that Forth is definitions. If you have a lot of small definitions you are writing Forth. In order to write a lot of small definitions you have to have a stack.

In order to have really small definitions, you do need a stack, I guess - or some other implicit way of passing parameters around; if you do that explicitly, definitions get bigger, right? That's how you can get somewhat Forth-y with Perl - passing things through the implicit variable $_: call chop without arguments, and you will have chopped $_.

Anyway, I tried many small definitions:

:inline num_rects params @ ;
:inline sum  3 lshift gray_sums + ;
:inline sum2 3 lshift gray_sums 4 + + ;
:inline rect_offset 4 lshift ;
:inline inv_area rect_offset rects 8 + + @ ;
:inline mean_std_stat ( lo hi — stat )
  FRAC lshift swap 32 FRAC - rshift or
;
: mean_std_loop
 \ inv_global_std = (1LL << 32) / MAX(global_std, 1);
 dup 1 max 1 swap u/mod-fx32 drop \ 32 frac bits

 num_rects \ start countdown
 begin
  1 - \ rects–
  dup sum2 @
  over sum @
  pick2 inv_area
  mean_std \ global_mean global_std inv_global_std rectind mean std
  rot dup { rectind } 2 NUM_STATS * * stats_arr OFT 2 * + + { stats }
  \ stats[OFT+0] = (short)( ((mean - global_mean) * inv_global_std) >> (32 - FRAC) );
  \ stats[OFT+1] = (short)( std * inv_global_std >> (32 - FRAC) );
  pick2       um* mean_std_stat stats 2 + h! \ global_mean global_std inv_global_std mean
  pick3 - over m* mean_std_stat stats h!
  rectind ?dup 0 = \ quit at rect 0
 until
 drop 2drop
;

I had a bunch of those short definitions, and yet I couldn't get rid of heavy functions with DUP and OVER and PICK and "C" comments to make any sense of it. This stack business just wasn't for me.

Stacks are not popular. It's strange to me that they are not. There is just a lot of pressure from vested interests that don't like stacks, they like registers.

But I actually had a vested interest in stacks, and I began to like registers more and more. The thing is, expression trees map perfectly to stacks: (a+b)*(c-d) becomes a b + c d - *. Expression graphs, however, start to get messy: (a+b)*a becomes a dup b + *, and this dup cluttering things up is a moderate example. And an "expression graph" simply means that you use something more than once. How come this clutters up my code? This is reuse. A kind of factoring, if you like. Isn't factoring good?

In fact, now that I thought of it, I didn't understand why stacks were so popular. Vested interests, perhaps? Why is the JVM bytecode and the .NET bytecode and even CPython's bytecode all target stack VMs? Why not use registers the way LLVM does?

Speaking of which. I started to miss a C compiler. I downloaded LLVM. (7000 files plus a huge precompiled gcc binary. 1 hour to build from scratch. So?) I wrote a working back-end for the stack machine within a week. Generating horrible code. Someone else wrote an optimizing back-end in about two months.

After a while, the optimizing back-end's code wasn't any worse than my hand-coded Forth. Its job was somewhat easier than mine since by the time it arrived, it only took 1 cycle to load a local. On the other hand, loads were fast as long as they weren't interleaved with stores - some pipeline thing. So the back-end was careful to reorder things so that huge sequences of loads went first and then huge sequences of stores. Would be a pity to have to do that manually in Forth.

You have no idea how much fun it is to just splatter named variables all over the place, use them in expressions in whatever order you want, and have the compiler schedule things. Although you do it all day. And that was pretty much the end of Forth on that machine; we wrote everything in C.

What does this say about Forth? Not much except that it isn't for me. Take Prolog. I know few things more insulting than having to code in Prolog. Whereas Armstrong developed Erlang in Prolog and liked it much better than reimplementing Erlang in C for speed. I can't imagine how this could be, but this is how it was. People are different.

Would a good Forth programmer do better than me? Yes, but not just at the level of writing the code differently. Rather, at the level of doing everything differently. Remember the freedom quote? "Forth is about the freedom to change the language, the compiler, the OS or even the hardware design".

…And the freedom to change the problem.

Those computations I was doing? In Forth, they wouldn't just write it differently. They wouldn't implement them at all. In fact, we didn't implement them after all, either. The algorithms which made it into production code were very different - in our case, more complicated. In the Forth case, they would have been less complicated. Much less.

Would less complicated algorithms work? I don't know. Probably. Depends on the problem though. Depends on how you define "work", too.

The tiny VLSI toolchain from the epigraph? I showed Chuck Moore's description of that to an ASIC hacker. He said it was very simplistic - no way you could do with that what people are doing with standard tools.

But Chuck Moore isn't doing that, under the assumption that you need not to. Look at the chips he's making. 144-core, but the cores (nodes) are tiny - why would you want them big, if you feel that you can do anything with almost no resources? And they use 18-bit words. Presumably under the assumption that 18 bits is a good quantity, not too small, not too large. Then they write an application note about imlpementing the MD5 hash function:

MD5 presents a few problems for programming a Green Arrays device. For one thing it depends on modulo 32 bit addition and rotation. Green Arrays chips deal in 18 bit quantities. For another, MD5 is complicated enough that neither the code nor the set of constants required to implement the algorithm will fit into one or even two or three nodes of a Green Arrays computer.

Then they solve these problems by manually implementing 32b addition and splitting the code across nodes. But if MD5 weren't a standard, you could implement your own hash function without going to all this trouble.

In his chip design tools, Chuck Moore naturally did not use the standard equations:

Chuck showed me the equations he was using for transistor models in OKAD and compared them to the SPICE equations that required solving several differential equations. He also showed how he scaled the values to simplify the calculation. It is pretty obvious that he has sped up the inner loop a hundred times by simplifying the calculation. He adds that his calculation is not only faster but more accurate than the standard SPICE equation. … He said, "I originally chose mV for internal units. But using 6400 mV = 4096 units replaces a divide with a shift and requires only 2 multiplies per transistor. … Even the multiplies are optimized to only step through as many bits of precision as needed.

This is Forth. Seriously. Forth is not the language. Forth the language captures nothing, it's a moving target. Chuck Moore constantly tweaks the language and largely dismisses the ANS standard as rooted in the past and bloated. Forth is the approach to engineering aiming to produce as small, simple and optimal system as possible, by shaving off as many requirements of every imaginable kind as you can.

That's why its metaprogramming is so amazingly compact. It's similar to Lisp's metaprogramming in much the same way bacterial genetic code is similar to that of humans - both reproduce. Humans also do many other things that bacteria can't (…No compatibility. No files. No operating system). And have a ton of useless junk in their DNA, their bodies and their habitat.

Bacteria have no junk in their DNA. Junk slows down the copying of the DNA which creates a reproduction bottleneck so junk mutations can't compete. If it can be eliminated, it should. Bacteria are small, simple, optimal systems, with as many requirements shaved off as possible. They won't conquer space, but they'll survive a nuclear war.

This stack business? Just a tiny aspect of the matter. You have complicated expression graphs? Why do you have complicated expression graphs? The reason Forth the language doesn't have variables is because you can eliminate them, therefore they are junk, therefore you should eliminate them. What about those expressions in your Forth program? Junk, most likely. Delete!

I can't do that.

I can't face people and tell them that they have to use 18b words. In fact I take pride in the support for all the data types people are used to from C in our VLIW machine. You can add signed bytes, and unsigned shorts, and you even have instructions adding bytes to shorts. Why? Do I believe that people actually need all those combinations? Do I believe that they can't force their 16b unsigned shorts into 15b signed shorts to save hardware the trouble?

OF COURSE NOT.

They just don't want to. They want their 16 bits. They whine about their 16th bit. Why do they want 16 and not 18? Because they grew up on C. "C". It's completely ridiculous, but nevertheless, people are like that. And I'm not going to fight that, because I am not responsible for algorithms, other people are, and I want them happy, at least to a reasonable extent, and if they can be made happier at a reasonable cost, I gladly pay it. (I'm not saying you can't market a machine with a limited data type support, just using this as an example of the kind of junk I'm willing to carry that in Forth it is not recommended to carry.)

Why pay this cost? Because I don't do algorithms, other people do, so I have to trust them and respect their judgment to a large extent. Because you need superhuman abilities to work without layers. My minimal stack of layers is - problem, software, hardware. People working on the problem (algorithms, UI, whatever) can't do software, not really. People doing software can't do hardware, not really. And people doing hardware can't do software, etc.

The Forth way of focusing on just the problem you need to solve seems to more or less require that the same person or a very tightly united group focus on all three of these things, and pick the right algorithms, the right computer architecture, the right language, the right word size, etc. I don't know how to make this work.

My experience is, you try to compress the 3 absolutely necessary layers to 2, you get a disaster. Have your algorithms people talk directly to your hardware people, without going through software people, and you'll get a disaster. Because neither understands software very well, and you'll end up with an unusable machine. Something with elaborate computational capabilities that can't be put together into anything meaningful. Because gluing it together, dispatching, that's the software part.

So you need at least 3 teams, or people, or hats, that are to an extent ignorant about each other's work. Even if you're doing everything in-house, which, according to Jeff Fox, was essentially a precondition to "doing Forth". So there's another precondtion - having people being able to do what at least 3 people in their respective areas normally do, and concentrating on those 3 things at the same time. Doing the cross-layer global optimization.

It's not how I work. I don't have the brain nor the knowledge nor the love for prolonged meditation. I compensate with, um, people skills. I find out what people need, that is, what they think they need, and I negotiate, and I find reasonable compromises, and I include my narrow understanding of my part - software tools and such - into those compromises. This drags junk in. I live with that.

I wish I knew what to tell you that would lead you to write good Forth. I can demonstrate. I have demonstrated in the past, ad nauseam, applications where I can reduce the amount of code by 90% and in some cases 99%. It can be done, but in a case by case basis. The general principle still eludes me.

And I think he can, especially when compatibility isn't a must. But not me.

I still find Forth amazing, and I'd gladly hack on it upon any opportunity. It still gives you the most bang for the buck - it implements the most functionality in the least space. So still a great fit for tiny targets, and unlikely to be surpassed. Both because it's optimized so well and because the times when only bacteria survived in the amounts of RAM available are largely gone so there's little competition.

As to using Forth as a source of ideas on programming and language design in general - not me. I find that those ideas grow out of an approach to problem solving that I could never apply.

Consistency: how to defeat the purpose of IEEE floating point

I don't know much about the design of IEEE floating point, except for the fact that a lot of knowledge and what they call "intellectual effort" went into it. I don't even know the requirements, and I suspect those were pretty detailed and complex (for example, the benefits of having a separate representation for +0 and -0 seem hard to grasp unless you know about the very specific and hairy examples in the complex plane). So I don't trust my own summary of the requirements very much. That said, here's the summary: the basic purpose of IEEE floating point is to give you results of the highest practically possible precision at each step of your computation.

I'm not going to claim this requirement is misguided, because I don't feel like arguing with people two orders of magnitude more competent than myself who have likely faced much tougher numerical problems than I've ever seen. What I will claim is that differences in numerical needs divide programmers into roughly three camps, and the highest-possible-precision approach hurts one of them really badly, and so has to be worked around in ways we'll discuss. The camps are:

  1. The huge camp of people who do businessy accounting. Those should work with integral types to get complete, deterministic, portable control over rounding and all that. Many of the clueless people in this camp represent 1 dollar and 10 cents as the floating point number 1.1. While they are likely a major driving force behind economical growth, I still think they deserve all the trouble they bring upon themselves.
  2. The tiny camp doing high-end scientific computing. Those are the people who can really appreciate the design of IEEE floating point and use its full power. It's great that humanity accidentally satisfied the needs of this small but really cool group, making great floating point hardware available everywhere through blind market forces. It's like having a built-in Stradivari in each home appliance. Yes, perhaps I exaggerate; I get that a lot.
  3. The sizable camp that deals with low-end to mid-range semi-scientific computing. You know, programs that have some geometry or physics or algebra in them. 99.99% of the code snippets in that realm work great with 64b floating point, without the author having invested any thought at all into "numerical analysis". 99% of the code works with 32b floats. When someone stumbles upon a piece of code in the 1% and actually witnesses fatal precision loss, everybody gathers to have a look as if they heard about a beautiful rainbow or a smoke suggesting a forest fire near the horizon.

The majority of people who use and actually should be using floating point are thus in camp 3. Those people don't care about precision anywhere near camp 2, nor do they know how to make the best of IEEE floating point in the very unlikely circumstances where their naive approach will actually fail to work. What they do care about though is consistency. It's important that things compute the same on all platforms. Perhaps more importantly for most, they should compute the same under different build settings, most notably debug and release mode, because otherwise you can't reproduce problems.

Side note: I don't believe in build modes; I usually debug production code in release mode. It's not just floating point that's inconsistent across modes - it's code snippets with behavior undefined by the language, buggy dependence on timing, optimizer bugs, conditional compilation, etc. Many other cans of worms. But the fact is that most people have trouble debugging optimized code, and nobody likes it, so it's nice to have the option to debug in debug mode, and to do that, you need things to reproduce there.

Also, comparing results of different build modes is one way to find worms from those other cans, like undefined behavior and optimizer bugs. Also, many changes you make are optimizations or refaptorings and you can check their sanity by making sure they didn't change the results of the previous version. As we'll see, IEEE FP won't give you even that, regardless of platforms and build modes. The bottom line is that if you're in camp 3, you want consistency, while the "only" things you can expect from IEEE FP is precision and speed. Sure, "only" should be put in quotes because it's a lot to get, it's just a real pity that fulfilling the smaller and more popular wish for consistency is somewhere between hard and impossible.

Some numerical analysts seem annoyed by the camp 3 whiners. To them I say: look, if IEEE FP wasn't the huge success that it is in the precision and speed departments, you wouldn't be hearing from us because we'd be busy struggling with those problems. What we're saying is the exact opposite of "IEEE FP sucks". It's "IEEE FP is so damn precise and fast that I'm happy with ALL of its many answers - the one in optimized x86 build, the one in debug PowerPC build, the one before I added a couple of local variables to that function and the one I got after that change. I just wish I consistently got ONE of these answers, any of them, but the same one." I think it's more flattering than insulting.

I've accumulated quite some experience in defeating the purpose of IEEE floating point and getting consistency at the (tiny, IMO) cost of precision and speed. I want to share this knowledge with humanity, with the hope of getting rewarded in the comments. The reward I'm after is a refutation of my current theory that you can only eliminate 95%-99% of the pain automatically and have to solve the rest manually each time it raises its ugly head.

The pain breakdown

I know three main sources of floating point inconsistency pain:

  1. Algebraic compiler optimizations
  2. "Complex" instructions like multiply-accumulate or sine
  3. x86-specific pain not available on any other platform; not that ~100% of non-embedded devices is a small market share for a pain.

The good news is that most pain comes from item 3 which can be more or less solved automatically. For the purpose of decision making ("should we invest energy into FP consistency or is it futile?"), I'd say that it's not futile and if you can cite actual benefits you'd get from consistency, than it's worth the (continuous) effort.

Disclaimer: I only discuss problems I know and to the extent of my understanding. I have no solid evidence that this understanding is complete. Perhaps the pain breakdown list should have item 4, and perhaps items 1 to 3 have more meat than I think. And while I tried to get the legal stuff right - which behavior conforms to IEEE 754, which conforms to C99, and which conforms to nothing but is still out there - I'm generally a weak tech lawyer and can be wrong. I can only give you the "worked on my 4 families of machines" kind of warranty.

Algebraic compiler optimizations

Compilers, or more specifically buggy optimization passes, assume that floating point numbers can be treated as a field - you know, associativity, distributivity, the works. This means that a+b+c can be computed in either the order implied by (a+b)+c or the one implied by a+(b+c). Adding actual parentheses in source code doesn't help you one bit. The compiler assumes associativity and may implement the computation in the order implied by regrouping your parentheses. Since each floating point operation loses precision on some of the possible inputs, the code generated by different optimizers or under different optimization settings may produce different results.

This could be extremely intimidating because you can't trust any FP expression with more than 2 inputs. However, I think that programming languages in general don't allow optimizers to make these assumptions, and in particular, the C standard doesn't (C99 §5.1.2.3 #13, didn't read it in the document but saw it cited in two sources). So this sort of optimization is actually a bug that will likely be fixed if reported, and at any given time, the number of these bugs in a particular compiler should be small.

I only recall one recurring algebraic optimization problem. Specifically, a*(b+1) tended to be compiled to a*b+a in release mode. The reason is that floating-point literal values like 1 are expensive; 1 becomes a hairy hexadecimal value that has to be loaded from a constant address at run time. So the optimizer was probably happy to optimize a literal away. I was always able to solve this problem by changing the spelling in the source code to a*b+a - the optimizer simply had less work to do, while the debug build saw no reason to make me miserable by undoing my regrouping back into a*(b+1).

This implies a general way of solving this sort of problem: find what the optimizer does by looking at the generated assembly, and do it yourself in the source code. This almost certainly guarantees that debug and release will work the same. With different compilers and platforms, the guarantee is less certain. The second optimizer may think that the optimization you copied from the first optimizer into your source code is brain-dead, and undo it and do a different optimization. However, that means you target two radically different optimizers, both of which are buggy and can't be fixed in the near future; how unlucky can you get?

The bottom line is that you rarely have to deal with this problem, and when it can't be solved with a bug report, you can look at the assembly and do the optimization in the source code yourself. If that fails because you have to use two very different and buggy compilers, use the shotgun described in the next item.

"Complex" instructions

Your target hardware can have instructions computing "non-trivial" expressions beyond a*b or a+b, such as a+=b*c or sin(x). The precision of the intermediate result b*c in a+=b*c may be higher than the size of an FP register would allow, had that result been actually stored in a register. IEEE and the C standard think it's great, because the single instruction generated from a+=b*c is both faster and more precise than the 2 instructions implementing it as d=b*c, a=a+d. Camp 3 people like myself don't think it's so great, because it happens in some build modes but not others, and across platforms the availability of these instruction varies, as does their precision.

AFAIK the "contraction" of a+=b*c is permitted by both the IEEE FP standard (which defines FP + and *) and the C standard (which defines FP types that can map to standards other than IEEE). On the other hand, sin(x), which also gets implemented in hardware these days, isn't addressed by either standard - to the same effect of making the optimization perfectly legitimate. So you can't solve this by reporting a bug the way you could with algebraic optimizations. The other way in which this is tougher is that tweaking the code according to the optimizer's wishes doesn't help much. AFAIK what can help is one of these two things:

  1. Ask the compiler to not generate these instructions. Sometimes there's an exact compiler option for that, like gcc's platform-specific -mno-fused-madd flag, or there's (a defined and actually implemented) pragma directive such as #pragma STDC FP_CONTRACT. Sometimes you don't have any of that, but you can lie to the compiler that you're using an older (compatible) revision of the processor architecture without the "complex" instructions. The latter is an all-or-nothing thing enabling/disabling lots of stuff, so it can degrade performance in many different ways; you have to check the cost.
  2. If compiler flags can't help, there's the shotgun approach I promised to discuss above, also useful for hypothetical cases of hard-to-work-around algebraic optimizations. Instead of helping the optimizer, we get in its way and make optimization impossible using separate compilation. For example, we can convert a+=b*c to a+=multiply_dammit(b,c); multiply_dammit is defined in a separate file. This makes it impossible for most optimizers to see the expression a+=b*c, and forces them to implement multiplication and addition separately. Modern compilers support link-time inlining and then they do optimize the result as a whole. But you can disable that option, and as a side effect speed up linkage a great deal; if that seriously hurts performance, your program is insane and you're a team of scary ravioli coders.

The trouble with the shotgun approach, aside from its ugliness, is that you can't afford to shoot at the performance-critical parts of your code that way. Let us hope that you'll never really have to choose between FP consistency and performance, as I've never had to date.

x86

Intel is the birthplace of IEEE floating point, and the manufacturer of the most camp-3-painful and otherwise convoluted FP hardware. The pain comes, somewhat understandably, from a unique commitment to the IEEE FP philosophy - intermediate results should be as precise as possible; more on that in a moment. The "convoluted" part is consistent with the general insanity of the x86 instruction set. Specifically, the "old" (a.k.a "x87") floating point unit uses a stack architecture for addressing FP operands, which is pretty much the exact opposite of the compiler writer's dream target, but so is the rest of x86. The "new" floating point instructions in SSE don't have these problems, at the cost of creating the aesthetic/psychiatric problem of actually having two FP ISAs in the same processor.

Now, in our context we don't care about the FP stack thingie and all that, the only thing that matters is the consistency of precision. The "old" FP unit handles precision thusly. Precision of stuff written to memory is according to the number of bits of the variable, 'cause what else can it be. Precision of intermediate results in the "registers" (or the "FP stack" or whatever you call it) is defined according to the FPU control & status register, globally for all intermediate results in your program.

By default, it's 80 bits. This means that when you compute a*b+c*d and a,b,c,d are 32b floats, a*b and c*d are computed in 80b precision, and then their 80b sum is converted to a 32b result in memory (if a*b+c*d is indeed written to memory and isn't itself an "intermediate" result). Indeed, what's "intermediate" in the sense of not being written to memory and what isn't? That depends on:

  1. Debug/release build. If we have "float e=a*b+c*d", and e is only used once right in the next line, the optimizer probably won't see a point in writing it to memory. However, in a debug build there's a good reason to allocate it in memory, because if you single-step your program and you're already past the line that used e, you still might want to look at the value of e, so it's good that the compiler kept a copy of it for the debugger to find.
  2. The code "near" e=a*b+c*d according to the compiler's notion of proximity also affects its decisions. There are only so many registers, and sometimes you run out of them and have to store things in memory. This means that if you add or remove code near the line or in inline functions called near the line, the allocation of intermediate results may change.

Compilers could have an option asking them to hide this mess and give us consistent results. The problems with this are that (1) if you care about cross-platform/compiler consistency, then the availability of cross-mode consistency options in one compiler doesn't help with the other compiler and (2) for some reason, compilers apparently don't offer this option in a very good way. For example, MS C++ used to have a /fltconsistency switch but seems to have abandoned it in favor of an insane special-casing of the syntax float(a*b)+float(c*d) - that spelling forces consistency (although the C++ standard doesn't assign it a special meaning not included in the plain and sane a*b+c*d).

I'd guess they changed it because of the speed penalty it implies rather than the precision penalty as they say. I haven't heard about someone caring both about consistency and that level of precision, but I did hear that gcc's consistency-forcing -ffloat-store flag caused notable slowdowns. And the reason it did is implied by its name - AFAIK the only way to implement x86 FP consistency at compile time is to generate code storing FP values to memory to get rid of the extra precision bits. And -ffloat-store only affects named variables, not unnamed intermediate results (annoying, isn't it?), so /fltconsistency, assuming it actually gave you consistency of all results, should have been much slower. Anyway, the bottom line seems to be that you can't get much help from compilers here; deal with it yourself. Even Java gave up on its initial intent of getting consistent results on the x87 FPU and retreated to a cowardly strictfp scheme.

And the thing is, you never have to deal with it outside of x86 - all floating point units I've come across, including the ones specified by Intel's SSE and SSE2, simply compute 32b results from 32b inputs. People who decided to do it that way and rob us of quite some bits of precision have my deepest gratitude, because there's absolutely no good way to work around the generosity of the original x86 FPU designers and get consistent results. Here's what you can do:

  1. Leave the FP CSR configured to 80b precision. 32b and 64b intermediate results aren't really 32b and 64b. The fact that it's the default means that if you care about FP result consistency, intensive hair pulling during your first debugging sessions is an almost inevitable rite of passage.
  2. Set the FP CSR to 64b precision. If you only use 64b variables, debug==release and you're all set. If you have 32b floats though, then intermediate 32b results aren't really 32b. And usually you do have 32b floats.
  3. Set the FP CSR to 32b precision. debug==release, but you're far from "all set" because now your 64b results, intermediate or otherwise, are really 32b. Not only is this a stupid waste of bits, it is not unlikely to fail, too, because 32b isn't sufficient even for some of the problems encountered by camp 3. And of course it's not compatible with other platforms.
  4. Set the FP CSR to 64b precision during most of the program run, and temporarily set it to 32b in the parts of the program using 32b floats. I wouldn't go down that path; option 4 isn't really an option. I doubt that you use both 32b and 64b variables in a very thoughtful way and manage to have a clear boundary between them. If you depend on the ability of everyone to correctly maintain the CSR, then it sucks sucks sucks.

Side note: I sure as hell don't believe in "very special" "testing" build/running modes. For example, you could say that you have a special mode where you use option (3) and get 32b results, and use that mode to test debug==release or something. I think it's completely self-defeating, because the point of consistency is being able to reproduce a phenomenon X that happens in a mode which is actually important, in another mode where reproducing X is actually useful. Therefore, who needs consistency across inherently useless modes? We'd be defeating the purpose of defeating the purpose of IEEE floating point.

Therefore, if you don't have SSE, the only option is (2) - set the FP CSR to 64b and try to avoid 32b floats. On Linux, you can do it with:

#include <fpu_control.h>
fpu_control_t cw;
_FPU_GETCW(cw);
cw = (cw & ~_FPU_EXTENDED) | _FPU_DOUBLE;
_FPU_SETCW(cw);

Do it first thing in main(). If you use C++, you should do it first thing before main(), because people can use FP in constructors of global variables. This can be achieved by figuring out the compiler-specific translation unit initialization order, compiling your own C/C++ start-up library, overriding the entry point of a compiled start-up library using stuff like LD_PRELOAD, overwriting it in a statically linked program right there in the binary image, having a coding convention forcing to call FloatingPointSingleton::instance() before using FP, or shooting the people who like to do things before main(). It's a trade-off.

The situation is really even worse because the FPU CSR setting only affects mantissa precision but not the exponent range, so you never work with "real" 64b or 32b floats there. This matters in cases of huge numbers (overflow) and tiny numbers (double rounding of subnormals). But it's bad enough already, and camp 3 people don't really care about the extra horror; if you want those Halloween stories, you can find them here. The good news are that today, you are quite likely to have SSE2 and very likely to have SSE on your machine. So you can automatically sanitize all the mess as follows:

  1. If you have SSE2, use it and live happily ever after. SSE2 supports both 32b and 64b operations and the intermediate results are of the size of the operands. BTW, mixed expressions like a+b where a is float and b is double don't create consistency problems on any platform because the C standard specifies the rules for promotion precisely and portably (a will be promoted to double). The gcc way of using SSE2 for FP is -mfpmath=sse -msse2.
  2. If you only have SSE, use it for 32b floats which it does support (gcc: -mfpmath=sse -msse). 64b floats will go to the old FP unit, so you'll have to configure it to 64b intermediate results. Everything will work, the only annoying things being (1) the retained necessity to shoot the people having fun before main and (2) the slight differences in the semantics of control flags in the old FP and the SSE FP CSR, so if you configure your own policy, floats and doubles will not behave exactly the same. Neither problem is a very big deal.

Interestingly, SSE with its support for SIMD FP commands actually can make things worse in the standard-violating-algebraic-optimizations department. Specifically, Intel's compiler reportedly has (had?) an optimization which unrolls FP accumulation loops and reorders additions in order to utilize SIMD FP commands (gcc 4 does that, too, but only if you explicitly ask for trouble with -funsafe-math-optimizations or similar). But I wouldn't conclude anything from it, except that automatic vectorization, which is known to work only on the simplest of code snippets, actually doesn't work even on them.

Summary: use SSE2 or SSE, and if you can't, configure the FP CSR to use 64b intermediates and avoid 32b floats. Even the latter solution works passably in practice, as long as everybody is aware of it.

I think I covered everything I know except for things like long double, FP exceptions, etc. - and if you need that, you're not in camp 3; go away and hang out with your ivory tower numerical analyst friends. If you know a way to automate away more pain, I'll be grateful for every FP inconsistency debugging afternoon your advice will save me.

Happy Halloween!

Optimal processor size

I'm going to argue that high-performance chip designs ought to use a (relatively) modest number of (relatively) strong cores. This might seem obvious. However, enough money is spent on developing the other kinds of chips to make the topic interesting, at least to me.

I must say that I understand everyone throwing millions of dollars at hardware which isn't your classic multi-core design. I have an intimate relationship with multi-core chips, and we definitely hate each other. I think that multi-core chips are inherently the least productive programming environment available. Here's why.

Our contestants are:

  • single-box, single-core
  • single-box, multi-core
  • multi-box

With just one core, you aren't going to parallelize the execution of anything in your app just to gain performance, because you won't gain any. You'll only run things in parallel if there's a functional need to do so. So you won't have that much parallelism. Which is good, because you won't have synchronization problems.

If you're performance-hungry enough to need many boxes, you're of course going to parallelize everything, but you'll have to solve your synchronization problems explicitly and gracefully, because you don't have shared memory. There's no way to have a bug where an object happens to be accessed from two places in parallel without synchronization. You can only play with data that you've computed yourself, or that someone else decided to send you.

If you need performance, but for some reason can't afford multiple boxes (you run on someone's desktop or an embedded device), you'll have to settle for multiple cores. Quite likely, you're going to try to squeeze every cycle out of the dreaded device you have to live with just because you couldn't afford more processing power. This means that you can't afford message passing or a side-effect-free environment, and you'll have to use shared memory.

I'm not sure about there being an inherent performance impact to message passing or to having no side effects. If I try to imagine a large system with massive data structures implemented without side effects, it looks like you have to create copies of objects at the logical level. Of course, these copies can then be optimized out by the implementation; I just think that some of the copies will in fact be implemented straight-forwardly in practice.

I could be wrong, and would be truly happy if someone explained to me why. I mean, having no side effects helps analyze the data flow, but the language is still Turing-complete, so you don't always know when an object is no longer needed, right? So sometimes you have to make a new object and keep the old copy around, just in case someone needs it, right? What's wrong with this reasoning? Anyway, today I'll assume that you're forced to use mutable shared memory in multi-core systems for performance reasons, and leave this no-side-effects business for now.

Summary: multiple cores is for performance-hungry people without a budget for buying computational power. So they end up with awful synchronization problems due to shared memory mismanagement, which is even uglier than normal memory mismanagement, like leaks or dangling references.

Memory mismanagement kills productivity. Maybe you disagree; I won't try to convince you now, because, as you might have noticed, I'm desperately trying to stay on topic here. And the topic was that multi-core is an awful environment, so it's natural for people to try to develop a better alternative.

Since multi-core chips are built for anal-retentive performance weenies without a budget, the alternative should also be a high-performance, cheap system. Since the clock frequency doesn't double as happily as it used to these days, the performance must come from parallelism of some sort. However, we want to remove the part where we have independent threads accessing shared memory. What we can do is two things:

  • Use one huge processor.
  • Use many tiny processors.

What does processor "size" have to do with anything? There are basically two ways of avoiding synchronization problems. The problems come from many processors accessing shared memory. The huge processor option doesn't have many processors; the tiny processors option doesn't have shared memory.

The huge processor would run one thread of instructions. To compensate for having just one processor, each instruction would process a huge amount of data, providing the parallelism. Basically we'd have a SIMD VLIW array, except it would be much much wider/deeper than stuff like AltiVec, SSE or C6000.

The tiny processors would talk to their neighbor tiny processors using tiny FIFOs or some other kind of message passing. We use FIFOs to eliminate shared memory. We make the processors tiny because large processors are worthless if they can't process large amounts of data, and large amounts of data mean lots of bandwidth, and lots of bandwidth means memory, and we don't want memory. The advantage over the SIMD VLIW monster is that you run many different threads, which gives more flexibility.

So it's either huge or tiny processors. I'm not going to name any particular architecture, but there were and still are companies working on such things, both start-ups and seasoned vendors. What I'm claiming is that these options provide less performance per square millimeter compared to a multi-core chip. So they can't beat multi-core in the anal-retentive performance-hungry market. Multiple cores and the related memory mismanagement problems are here to stay.

What I'm basically saying is, for every kind of workload, there exists an optimal processor size. (Which finally gets me to the point of this whole thing.) If you put too much stuff into one processor, you won't really be able to use that stuff. If you don't put enough into it, you don't justify the overhead of creating a processor in the first place.

When I think about it, there seems to be no way to define a "processor" in a "universal" way; a processor could be anything, really. Being the die-hard von-Neumann machine devotee that I am, I define a processor as follows:

  • It reads, decodes and executes an instruction stream (a "thread")
  • It reads and writes memory (internal and possibly external)

This definition ignores at least two interesting things: that the human brain doesn't work that way, and that you can have hyper-threaded processors. I'll ignore both for now, although I might come back to the second thing some time.

Now, you can play with the "size" of the processor - its instructions can process tiny or huge amounts of data; the local memory/cache size can also vary. However, having an instruction processing kilobytes of data only pays off if you can normally give the processor that much data to multiply. Otherwise, it's just wasted hardware.

In a typical actually interesting app, there aren't that many places where you need to multiply a zillion adjacent numbers at the same cycle. Sure, your app does need to multiply a zillion numbers per second. But you can rarely arrange the computations in a way meeting the time and location constraints imposed by having just one thread.

I'd guess that people who care about, say, running a website back-end efficiently know exactly what I mean; their data is all intertwined and messy, so SIMD never works for them. However, people working on number crunching generally tend to underestimate the problem. The abstract mental model of their program is usually much more regular and simple in terms of control flow and data access patterns than the actual code.

For example, when you're doing white board run time estimations, you might think of lots of small pieces of data as one big one. It's not at all the same; if you try to convince a huge SIMD array that N small pieces of data are in fact one big one, you'll get the idea.

For many apps, and I won't say "most" because I've never counted, but for many apps, data parallelism can only get you that much performance; you'll need task parallelism to get the rest. "Task parallelism" is when you have many processors running many threads, doing unrelated things.

One instruction stream is not enough, unless your app is really (and not theoretically) very simple and regular. If you have one huge processor, most of it will remain idle most of the time, so you will have wasted space in your chip.

Having ruled out one huge processor, we're left with the other extreme - lots of tiny ones. I think that this can be shown to be inefficient in a relatively intuitive way.

Each time you add a "processor" to your system, you basically add overhead. Reading and decoding instructions and storing intermediate results to local memory is basically overhead. What you really want is to compute, and a processor necessarily includes quite some logic for dispatching computations.

What this means is that if you add a processor, you want it to have enough "meat" for it to be worth the trouble. Lots of tiny processors is like lots of managers, each managing too few employees. I think this argument is fairly intuitive, at least I find it easy to come up with a dumb real-world metaphor for it. The huge processor suffering from "lack of regularity" problems is a bit harder to treat this way.

Since a processor has an "optimal size", it also has an optimal level of performance. If you want more performance, the optimal way to get it is to add more processors of the same size. And there you have it - your standard, boring multi-core design.

Now, I bet this would be more interesting if I could give figures for this optimal size. I could of course shamelessly weasel out of this one - the optimal size depends on lots of things, for example:

  • Application domain. x86 runs Perl; C6000 runs FFT. So x86 has speculative execution, and C6000 has VLIW. (It turns out that I use the name "Perl" to refer to all code dealing with messy data structures, although Python, Firefox and Excel probably aren't any different. The reason must be that I think of "irregular" code in general and Perl in particular as a complicated phenomenon, and a necessary evil).
  • The cost of extra performance. Will your customer pay extra 80% for extra 20% of performance? For an x86-based system, the answer is more likely to be "yes" than for a C6000-based system. If the answer is "yes", adding hardware for optimizing rare use cases is a good idea.

I could go on and on. However, just for the fun of it, I decided to share my current magic constants with you. In fact there aren't many of them - I think that you can use at most 8-16 of everything. That is:

  • At most 8-16 vector elements for SIMD operations
  • At most 8-16 units working in parallel for VLIW machines
  • At most 8-16 processors per external memory module

Also, 8-16 of any of these is a lot. Many good systems will use, say, 2-4 because their application domain is more like Perl than FFT in some respect, so there's no chance of actually utilizing that many resources in parallel.

I have no evidence that the physical constants of the universe make my magic constants universally true. If you know about a great chip that breaks these "rules", it would be interesting to hear about it.

The Algorithmic Virtual Machine

There's a very influential platform called the AVM, which stands for Algorithmic Virtual Machine. That's the imaginary device people use as their mental model of a computer. In particular, it's used by many people working on algorithms where performance matters. Performance matters in many different contexts, ranging from huge clusters processing astronomic amounts of data to modest applications running on pathetically weak hardware. However, I believe that the core architecture of the AVM is basically the same everywhere.

AVM application development is done using the ubiquitous AVM SDK - a whiteboard and a couple of hands for handwaving. An AVM application consists of a set of operations your algorithm needs executed. Each operation has a cost (typically one cycle, sometimes more). You can then estimate the run time of your algorithm by the clever technique of summing the cost of all operations.

These estimations are never close enough to the real run time. The definition of "close enough" varies; the quality of estimations, by and large, doesn't. That is, I claim that your handwavy AVM-derived estimation will fail to meet your precision requirements no matter what those requirements are. Apparently our tolerance for errors grows with the lack of understanding of the problem, but it never grows enough. But I'm not really sure about this theory; I'm only sure about AVM-estimations-suck part. Here's why.

The AVM is basically this imaginary machine that runs "operations". Here are some things that real machines must do, but the AVM doesn't:

  • Fetching instructions
  • Fetching operands
  • Testing for conditions
  • Storing results

Basically, the Algorithmic Virtual Machine developers concentrate on "operations" and ignore addressing, branches, caches, buses, registers, pipelines, and all those other gadgets which are needed in order to dispatch the operation. In fact, that's how I currently distinguish between people who write software to get a job done and people who think of software as their job. "People who program" are into operations (algebra, networking, AI); "programmers" are into dispatching (programming languages, operating systems, OO). This is about mental focus rather than aptitude. I haven't noticed that people of either group are inherently less productive than the other kind.

When they're after performance, the "operations" people will naturally look for a way to reduce the number of operations. Sometimes, they'll find an algorithm with a better asymptotic complexity - O(N+M) instead of O(N*M). At other times, they'll come up with a way to perform 4*N*M operations instead of 16*N*M. Both results are very significant - if M and N are the only variables. The trouble is that you can't see all the variables if you just look at the math (as in "we want to multiply and sum all these and then compare to that"). That way, you assume that you run on the AVM and leave out all the dispatching-related variables and get the wrong answer.

Is there a way to take the cost of dispatching into account? Not really, not without implementing your algorithm and measuring its performance. However, families of machines do have related sets of heuristics that can be used to guess the cost of running on them. For example, here are a couple of heuristics that I use for SIMD machines (they are relevant elsewhere, but their relative importance may drop):

  1. Bandwidth is costly.
  2. Addressing is costly.

These heuristics are vague, and I don't see a very good way to make them formal. Perhaps there isn't any. To show that my points have any formal significance, I'd have to formally prove that there's unavoidable intrinsic cost to some things no matter how you build your hardware. And I don't know how to go about that. So what I'll do is I'll give some examples to show what I mean, and leave it there.

Bandwidth

Consider two "algorithms" (probably too fancy a name in this context): computing dot product, and computing its partial sums (Matlab: sum(a .* b) and cumsum(a .* b)). Exactly the same amount of "operations" - N multiplications and N additions. Many people with BA, MSc and PhD degrees in CS assume that the run time is going to be the same, too. It won't, because sum only produces one output, and cumsum produces N outputs. Worse, if the input vector elements are 8-bit integers, we probably need at least 32 bits for each output element. So we generate N*4 bytes of output from N*2 input bytes.

At this point, some people will say "Yeah, memory. Processors are fast, memories are slow, sure, memory is a problem". But it isn't just about the memory; memory bandwidth is just one kind of bandwidth. Let's look at the non-memory problems of the partial-sums-of-dot-product algorithm. On the way, I'll try to show how the "bandwidth costs" heuristic can be used to guess what your hardware can do and what the performance will be.

Consider a machine with a SIMD instruction set. Most likely, the machine has registers of fixed width (say, 16 bytes), and each instruction gets 2 inputs and produces 1 output. Why? Well, the hardware ought to support 2 inputs and 1 output to do basic math. Now, if it also wants to have an instruction that produces, say, 4 outputs, then it needs to have 3 additional output buses from the data processing units to the register file. It also needs a multiplexer so that each of the 4 outputs can be routed to each of its N registers (N can be 16 or 32 or even 128). The cost of multiplexers is, roughly, O(M*N), where M is the number of inputs and N is the number of outputs. That's awfully costly. Bandwidth costs. So they probably use 2 inputs and 1 output everywhere.

Now, suppose the machine has 16 multipliers, which is quite likely - 1 multiplier for each register byte, so we can multiply 16 pairs of bytes simultaneously. Does this mean that we can then take those 16 products and compute 16 new partial sums, all in the same cycle? Nope, because, among other things, we'd need a command producing 16×4 bytes to do that, and that's too much bandwidth. Are we likely to have a command that updates less than 16 accumulators? Yes, because that would speed up dot products, and dot products are very important; let's look at the manual.

You're likely to find a command updating - guess how many? - 4 accumulators (32 bits times 4 equals 16 bytes, that's exactly one machine register). If the register size is 8 bytes, you'll probably get a command updating 2 accumulators, and so on. Sometimes the machine uses "register pairs" for output; that doubles the register size for output bandwidth calculation purposes. The bottom line is that instruction set extensions can speed up dot product to an extent impossible for its partial sums. You might have noticed another problem here, that of the dependency of a partial sum on the previous partial sum. Removing this dependency doesn't solve the bandwidth problem. For example, consider the vertical projection of point-wise multiplication of 2 8-bit images, which has the same not-enough-accumulators problem.

There is little you can do about the bandwidth problem in the partial sums case - the algorithm is I/O bound. Some algorithms aren't, so you can optimize them to minimize the cost of bandwidth. For example, matrix multiplication is essentially lots of dot products. If you do those dot products straightforwardly, you'll have a loop spending 2 commands for loading the matrix elements into registers, and one command for multiplying and accumulating (MAC). 2 loads per MAC means an overhead of 200%.

However, you can work on blocks - 4 rows of matrix A and 4 columns of matrix B, and compute the 4×4=16 dot products in your loop. That's 4+4=8 loads per 16 MACs; the overhead dropped to 50%. If you have enough registers to do this. And it's still quite impressive overhead, isn't it? Your typical AVM user would be very disappointed. (Yes, some machines can parallelize the loads and the MACs, but some can't, and it's a toy example, and stop nitpicking). BTW, blocking can be used to save loads from main memory to cache just like we've used it to save loads from cache to registers.

OK. With partial sums of dot product, the bandwidth problem kills performance, and with matrix multiplication, it doesn't. What about convolution, which is about as basic as our previous examples? Gee, I really don't know. It's tricky, because with convolution, you need to store intermediate results somewhere, and it's unclear how many of them you're going to need. The optimal implementation depends on the quirks of the data processing units, the I/O, and the filter size. If you come across a benchmark showing the performance of convolution on some machine, you'll probably find interesting variations caused by the filter size.

So we have a bread-and-butter algorithm, and non-trivial & non-portable performance characteristics. I think it's one indication that your own less straightforward algorithm will also perform somewhat unpredictably. Unless you know an exact reason for the opposite.

Addressing

Bandwidth is one problem with fetching operands and storing results. Another problem is figuring out where they go. In the case of registers, we have costly multiplexers for selecting the source and destination registers of instructions. In the case of memory, we have addresses. Computing addresses has a cost. Reading data from those addresses also has a cost. Some address sequences are costlier than others from one of these perspectives, or both.

The dumbest example is the misalignment problem. People who learned C on x86 are sometimes annoyed when they meet a PowerPC or an ARM or almost any other processor since it won't read a 32-bit integer from a misaligned address. So when you read a binary buffer from a file or a socket, you can't just cast the char* to an int* and expect it to work. Isn't it nice of x86 to properly handle these cases?

Maybe it's nice, maybe it isn't (at least if it failed, the code would be fixed to become legal C), but it sure is costly. The fact that it's "in the hardware" doesn't make it a single-cycle operation. If your address is misaligned, the 32 bits may reside in two different memory words (no matter what the word size is). The hardware will have to read the low word, and then read the high word, and then take the high bits of the low word and the low bits of the high word and make a single 32-bit value out of them. Because in one cycle, memories can only fetch one word from an aligned address.

Does it matter outside of I/O-related code using illegal pointer-casting? Consider the prosaic algorithm of computing the first derivative of a vector, spelled v(2:end)-v(1:end-1) in Matlab. If we run on a SIMD machine, we could execute several subtractions simultaneously. In order to do that, we need to fetch a word containing v[0]…v[15] and a word containing v[1]…v[16] (both zero-based). But the second word is misaligned. The handling of misalignment will have a cost, whether it's done in hardware or in software.

Well, at least the operands of subtraction live in subsequent addresses - 0,1,2…15 and 1,2,3…16. That's how data processing units like them: you read a pack of numbers from memory and feed them right into the array of adders, ready to crunch them. It's not always like that. Consider scaling: a(x) = b(s*x+t). This can be used to resize images (handy), or to play records at a different speed the way you'd do with a tape recorder (less handy, unless you like squeaky or growly voices).

Now, if s isn't integral (say, s=0.6), you'd have to fetch data from places such as s*x+t = 1.3, 1.9, 2.5, 3.1, 3.7... Suppose you want to use linear interpolation to approximate a(1.3) as a(1)*0.7+a(2)*0.3. So now we need to multiply the vector of "low" elements - a([1,1,2...]) - by the vector of weights - [0.7,0.1,0.5...] - and add the result to the similar product a([2,2,3...])*[0.3,0.9,0.5...]. The multiplications and the additions map nicely to SIMD instruction sets; the indexing doesn't, because you have those weird jumpy indexes. So this time, the addressing can become a real bottleneck because it can prevent you from using SIMD instructions altogether and serialize your entire computation.

Well, at least we access adjacent elements. This means that most memory accesses will hit the cache. When you bump into an element that isn't cached yet, the machine will bring a whole cache line (say, 32 bytes), and then you'll read the other elements in that cache line, so it will pay off. You can even issue cache prefetching instructions so that while you're working on the current cache line, the machine will read the next one in the background. That way, you'll hit the cache all the time, instead of having your processor repeatedly surprised (hey, I don't have a(32) in the cache!.. hey, I don't have a(64) in the cache!.. hey, I don't have…). Avoiding the regularly scheduled surprise can be really beneficial, although cache prefetching is truly disgusting (it's basically a very finicky kind of cooperative multi-tasking - you ought to stuff the prefetching commands into the exactly right spots in your code).

Now, consider a(x) = b(f(x)) - a generic transformation of an input vector given a function for computing the input coordinate from the output coordinate. We have no idea what the next address is going to be, do we? If the transformation is complicated enough, we're going to miss the cache a lot. By the way, if the transformation is in fact simple, and the compiler knows the transformation at compile time, the compiler is still very unlikely to generate optimal cache prefetching commands. Which is one of the gazillion differences between C++ templates and "machine-optimal" code.

DVMs and TVMs

My bandwidth and addressing heuristics don't model a real machine; they only model an upgrade to the AVM for SIMD machines. Multi-box computing is one example of an entire universe of considerations they fail to model. So what we got is a DVM - Domain-specific Virtual Machine.

Now, in order to estimate performance without measuring (which is necessary when you choose your optimizations - you just can't try all the different options), I recommend a TVM (Target-specific Virtual Machine). You get one as follows. You start with the AVM. This gives overly optimistic performance estimations. You then add the features needed to get a DVM. This gives overly pessimistic estimations.

Then, you ask some low-level-loving person: "What are the coolest features of this machine that other machines don't have?" This will give you the capabilities that the real processor has but its DVM doesn't have. For example, PowerPC with AltiVec extensions is basically a standard SIMD DVM plus vec_perm. I won't talk about vec_perm very much, but if you ever need to optimize for AltiVec, this is the one instruction you want to remember. It solves the indexing problem in the scaling example above, among other things. Using a SIMD DVM and forgetting about vec_perm would make AltiVec look worse than it really is, and some algorithms much more costly than they really are.

And this is how you get a TVM for your platform. The resulting mental model gives you a fairly realistic picture, second only to reading the entire manual and understanding the interactions of all the features (not that easy). And it definitely beats the AVM by… how do you estimate the quality of handwaving? OK, it beats the AVM by the factor of 5, on average. What, you want a proof? Just watch the hands go.