"Value", the irksome euphemism

An economic value is the worth of a good or service as determined by the market.

Wikipedia

You keep using that word. I do not think it means what you think it means.

Inigo Montoya

If people pay you $1, then the economic value of your good or service has been determined by the market to be $1.

"Creating value" is thus a euphemism for "getting people to pay you money" - which has nothing to do with the usual meaning of "value".

Why is "value" an irksome euphemism? Because heroin dealers "create value", as determined by the market.

In the context of my own profession, all of the following are examples of value creation:

  • An office suite using undocumented and constantly changing formats. Value to users having no other way to access their own and others' documents: $100-$500, depending.
  • A distribution channel allowing developers to deploy software on popular devices, for which no legal alternative exists. Value to developers: 30% of revenue.
  • A social network with a billion private and corporate users who signed up for free, with a new charge to reach a given share of one's audience. Value created: $200 per post.

The more money I've extracted from you, the more value I've created, haven't I?

I'm not picking on Microsoft, Apple or Facebook. I can imagine working for any of them. My conscience is as flexible as the next guy's.

(A particularly inflexible conscience is a horrible condition. Feet to which no mass-produced shoes fit are merely inconvenient. A conscience incompatible with mass-produced social arrangements is a huge burden - not just on its owner, but on his friends and family.)

All I'm saying is that goods and services are distinct from bads and disservices, though both "create value".

Moreover, some sort of disservice tends to be essential to "value creation", a.k.a the extraction of money. People are attached to their money, and will only part with it when given little choice. Microsoft, Apple and Facebook constantly hone their methods of limiting users' choice. Who doesn't?

Business is what it is. It's not that consumers (us) are any better than producers (us). Nor is it impossible for something "free" - as in speech, beer, rider, whatever - to be a disservice to its users.

I just don't think "value" is the right word.

Will OpenCL help displace GPGPU? Parallella, P2012, …

OpenCL is usually perceived as a C dialect for GPGPU programming - doing "general-purpose" computations (not necessarily graphics) on GPU hardware. "It's like Nvidia's CUDA C, but portable".

However, OpenCL the language is not really tied to the GPU architecture. That is, hardware could run OpenCL programs and have an architecture very different from a GPU, resulting in a very different performance profile.

OpenCL is possibly the first programming language promising portability across accelerators: "OpenCL is for accelerators what C is for CPUs". Portability is disruptive. When hardware vendor A displaces vendor B, portable software usually helps a great deal.

Will OpenCL - "the GPGPU language" - eventually help displace GPGPU, by facilitating "GP-something-else" - "general-purpose" accelerators which aren't like GPUs?

We'll discuss this question on general grounds, and consider two specific examples of recent OpenCL accelerators: Adapteva's Parallella and ST's P2012.

Why displace GPGPU?

First of all, whether GPGPU is likely to be displaced or not - what could "GP-something-else" possibly give us that GPGPU doesn't?

There are two directions from which benefits could come - you could call them two opposite directions:

  1. Let software (ab)use more types of special-purpose accelerators. GPGPU lets you utilize (abuse?) your GPU for general-purpose stuff. It could be nice to have "GPDSP" to utilize the DSPs in your phone, "GPISP" to utilize the ISP, "GPCVP" to utilize computer vision accelerators likely to emerge in the future, etc. From GPGPU to GP-everything.
  2. Give software accelerators which are more general-purpose to begin with. GPGPU means doing your general-purpose stuff under the constraints imposed by the GPU architecture. An OpenCL accelerator lifting some of these constraints could be very welcome.

Could OpenCL help us get benefits from any of the directions (1) and (2)?

(1) is about making use of anal-retentive, efficiency-obsessed, weird, incompatible hardware. It's rather hard, for OpenCL or for any other portable, reasonably "pretty" language.

OpenCL does provide constructs more or less directly mapping to some of the "ugly" features common to many accelerators, for example:

  • Explicitly addressed local memory (as opposed to cache)
  • DMA (bulk memory transfers)
  • Short vector data types to make use of SIMD opcodes
  • Light-weight threads and barriers

But even with GPUs, OpenCL can't target all of the GPU's resources. There's the subset of the GPU accessible to GPGPU programs - and then there are the more idiosyncratic and less flexible parts used for actual graphics processing.

With accelerators such as DSPs and ISPs, my guess is that today most of their value - acceleration ability - is in the idiosyncratic features that can't be made accessible to OpenCL programs. They could evolve, but it's a bit far-fetched and we won't dwell on it now. At their current state, OpenCL is too portable and too "pretty" to map to most accelerators.

What about direction (2)? (2) is about making something that's more efficient than CPUs, but as nice and flexible as possible, and more flexible than GPUs.

As a whole, (2) isn't easy, for various reasons we'll discuss. But if we look, in isolation, at OpenCL the language, then it looks like a great language for targeting "faster-than-CPUs-and-more-flexible-than-GPUs" kind of accelerator.

What could such an accelerator give us that GPUs don't?

One important feature is divergent flow. GPUs are SIMD or SIMT hardware; either way, they can't efficiently support something like this:

if(cond(i)) {
  out[i] = f(i);
}
else {
  out[i] = g(i);
}

What they'll end up doing is, essentially, compute f(i) and g(i) for all values of i, and then throw away some of the results. For deeply nested conditionals, the cost of wasted computations can make the entire exercise of porting to a GPU pointless.

We'll now have a look at two OpenCL-compatible accelerators which promise to efficiently support divergent threads - or outright independent threads doing something completely unrelated. We'll briefly compare them, and then discuss some of their common benefits as well as common obstacles to their adoption.

Adapteva's Parallella

Actually, the chip's name is Epiphany - Parallella is the recently publicized name of Adapteva's planned platform based on Epiphany; anyway.

Adapteva's architecture is a 2D grid of processors with a mesh interconnect. To scale, you can have a chip with more cores - or you can have a 2D grid of chips with some of the inter-core communication seamlessly crossing chip boundaries. Each of the (scalar) processors executes its own instruction stream - no "marching in lockstep", fittingly for divergent flow.

There are no caches; a memory address can map to your local memory, or the local memory of some other processor in the grid, or to external memory. Access latency will vary accordingly; access to local memories of close neighbors is quicker than access to far neighbors. All memory access can be done using either load/store instructions or DMA.

(Note that you can reach far neighbors - unlike some more "fundamentalist" proposals for "2D scalability" where you can only talk to immediate neighbors, period. I think that's over the top; if you want to run something other than the game of life, it's awfully handy to have long communication paths - as do most computers ranging from FPGAs to neurons, some of which have really long axons.)

Stats:

  • 32K memory per core (unified - both instructions and data)
  • 4 banks that can help avoid contentions between loads/stores, instruction fetching and DMA traffic
  • 2-issue cores (one floating point operation and one integer or load/store operation)
  • 800 MHz at 28nm using low-power process, as opposed to high speed (my bet that it's hard to top 800 MHz at 28nm LP - any evidence to the contrary?)
  • ~25mW per core, 2W peak power for a full chip with 64 cores
  • 0.128 mm^2 per core

Sources: BDTI's overview and various documentation from adapteva.com.

ST's Platform 2012

The P2012 architecture is also, at the top level, a grid of processors with a mesh interconnect. One stated motivation is the intra-die variability in future process nodes: some cores will come out slower than others, and some will be unusable.

It is thus claimed that a non-uniform architecture (like the ones we have today - CPUs and a boatload of different accelerators) will become a bad idea. If a core happens to come out badly, and it's not like your other cores, you have to throw away the entire chip. Whereas if cores are all alike, you leave the bad ones unused, and you may still have enough good ones to use the chip.

Interestingly, despite this stated motivation, the P2012 is less uniform and has higher granularity than Epiphany. Firstly, there's a provision for special-purpose accelerators in the grid. Secondly, the top-level mesh connects, not individual cores, but clusters of 16 rather tightly-coupled cores (each with its own flow of control, however - again, good for divergence).

Similarly to Epiphany, data is kept in explicitly addressed local memory rather than cache, and you can access data outside the cluster using load/store instructions or DMA, but you'll pay a price depending on the distance.

However, within a cluster, data access is uniform: the 16 cores share 256K of local data memory. This can be convenient for large working sets. Instructions are private to a core - but they're kept in a cache, not a local memory, conveniently for large programs.

Stats:

  • 32K per core (16 cores with 16K I-cache per core and 32 8K data memory banks)
  • 1MB L2 cache in a 4-cluster, "69-core" chip (presumably, (16+1)x4+1 - one extra core per cluster and per chip)
  • 2-issue cores (I failed to find which instructions can be issued in parallel)
  • 600 MHz at 28nm (process details unclear)
  • 2W for the 69-core chip
  • 0.217 mm^3 per core (3.7 mm^2 per (16+1)-core cluster), not accounting for L2 cache

Source: slides, slides, slides.

Parallela vs P2012: a quick comparison

Each of the architectures can have many different implementations and configurations. It seems fair to compare a 28nm 64-core Epiphany chip with a 28nm 69-core P2012 chip (or at least fair as far as these things go). Each system has its own incompatible native programming interface, but both can also be programmed in OpenCL.

Here's how Epiphany compares to P2012:

  • Power: 1x (2W)
  • Core issue width: 1x (2-issue)
  • Local memory size: 1x (32K per core)
  • Frequency: 1.33x (800/600)
  • Core area efficiency: 1.7x (0.217/0.128)

I think it's a fine achievement for Adapteva - a 5-people company (ST has about 50000 employees - of course not all of them work on  the P2012, but still; Chuck Moore's GreenArrays is 18 people - and he's considered the ultimate minimalist, and develops a much more minimalistic product which, for instance, certainly won't run OpenCL programs).

This is not to say that these numbers are sufficient to compare the architectures. For starters, we assume that the power is the same, but we can't know without benchmarking. Energy consumption varies widely across programs - low power process brings leakage down to about zero at room temperature, so you're left with switching power which depends on what code you run, and on what data (multiplying zeros costs almost nothing compared to multiplying noise, for instance).

Then there are programming model differences, ranging from the extent of compliance of floating point to the IEEE standard to the rather different memory models. In the memory department, the ability of P2012 cores to access larger working sets should somewhat negate Epiphany's raw performance advantage on some workloads (though Epiphany cores might have lower latency when accessing their own banks). But then two different 2-issue cores will generally perform differently - you need thorough benchmarking to compare.

So what are these numbers good for? Just for a very rough, ballpark estimation of the cost of this type of core. That is, a core which is flexible enough to run its own instruction stream - but "low-end" enough to burden the programmer with local memory management, and lacking much of the other amenities of full-blown CPUs (speculative prefetching, out-of-order execution, etc.)

Our two examples both point to the same order of magnitude of performance. Let's look at a third system, KALRAY's MPPA - looking more like P2012 than Epiphany, with 16-core clusters and cores sharing memory.

At 28nm, 256 cores are reported to typically consume 5W at 400 MHz. (Adapteva and ST claim to give worst case numbers). That's 20mW per core compared to Epiphany's 25mW - but Epiphany runs at 2x the frequency. If we normalized for frequency, Epiphany comes out 1.6x more power-efficient - and that's when we compare it's worst case power to MPPA's typical power.

MPPA doesn't support OpenCL at the moment, and I found few details about the architecture; our quick glance is only good to show that these "low-end multicore" machines have the same order of magnitude of efficiency.

So will OpenCL displace GPGPU?

The accelerators of the kind we discussed above - and they're accelerators, not CPUs, because they're horrible at running large programs as opposed to hand-optimized kernels - these accelerators have some nice properties:

  • You get scalar threads which can diverge freely and efficiently - this is a lot of extra flexibility compared to SIMT or SIMD GPUs.
  • For GPGPU workloads that don't need divergence, these accelerators probably aren't much worse than GPUs. You lose some power efficiency because of reading the same instructions from many program memories, but it should be way less than a 2x loss, I'd guess.
  • And there's a programming model ready for these accelerators - OpenCL. They can be programmed in other C dialects, but OpenCL is a widespread, standard one that can be used, and it lets you use features like explicitly managed local memory and DMA in a portable way.

From a programmer's perspective - bring them on! Why not have something with a standard programming interface, more efficient than CPUs, more flexible than GPUs - and running existing GPGPU programs almost as well as GPUs?

There are several roadblocks, however. First of all, there's no killer app for this type of thing - by definition. That is, for any killer app, almost certainly a much more efficient accelerator can be built for that domain. Generic OpenCL accelerators are good at accelerating the widest range of things, but they don't excel at accelerating anything.

There is, of course, at least one thriving platform which is, according to the common perception, "good at everything but excels at nothing" - FPGA. (I think it's more complicated than that but I'll leave it for another time.)

FPGAs are great for small to medium scale product delivery. The volume is too small to afford your own chip - but there may be too many things to accelerate which are too different from what an existing chip is good at accelerating. Flexible OpenCL accelerator chips could rival FPGAs here.

What about integrating these accelerators into high-volume chips such as application processors so they could compete with GPUs? Without a killer app, there's a real estate problem. At 100-150 mm^2, today's application processors are already rather large. And the new OpenGL accelerators aren't exactly small - they're bigger than any domain-specific accelerator.

Few chips are likely to include a large accelerator "just in case", without a killer app. Area is considered to be getting increasingly cheap. But we're far from the point where it's "virtually free", and the trend might not continue forever: GlobalFoundries' 14nm is a "low-shrink" node. Today, area is not free.

Of course, a new OpenCL accelerator does give some immediate benefit and so it isn't a purely speculative investment. That's because you could speed up existing OpenCL applications. But for existing code which is careful to avoid divergence, the accelerator would be somewhat less efficient than a GPU, and it wouldn't do graphics nearly as good as the GPU - so it'd be a rather speculative addition indeed.

What would make one add hardware for speculative reasons? A long life cycle. If you believe that your chip will have to accelerate important stuff many years after it's designed, then you'll doubt your ability to predict exactly what this stuff is going to be, and you'll want the most general-purpose accelerator.

Conversely, if you make new chips all the time, quickly sell a load of them, and then move on to market your next design, then you're less inclined to speculate. Anything that doesn't result in a visibly better product today is not worth the cost.

So generic OpenCL accelerators have a better shot at domains with long life cycles, which seem to be a minority. And then even when you found a vendor with a focus on the long term, you have the problem of performance portability.

Let's say platform vendor A does add the new accelerator to their chip. Awesome - except you probably also want to support vendor B's chips, which don't have such accelerators. And so efficient divergence is of no use to you, because it's not portable. Unless vendor A accounts for a very large share of the market - or if it's a dedicated device and you write a dedicated program and you don't care about portability.

OpenCL programs are portable, but their performance is not portable. For instance, if you use vector data types and the target platform doesn't have SIMD, the code will be compiled to scalar instructions, and so on.

What this means in practice is that one or several OpenCL subsets will emerge, containing features that people count on to be supported well. For instance, a relatively good scenario is, there's a subset that GPU programmers use on all GPUs. A worse scenario is, there's the desktop GPU subset and the mobile GPU subset. A still worse scenario is, there's the NVIDIA subset, the AMD subset, the Imagination subset, etc.

It's an evolving type of thing that's never codified anywhere but has more power than the actual standard.

Standards tend to materialize partially. For example, the C standard supports garbage collection, but real C implementations usually don't, and many real C programs will not run correctly on a standard-compliant, garbage-collecting implementation. Someone knowing C would predict this outcome, and would not trust the standard to change it.

So with efficient divergence, the question is, will this feature make it into a widely used OpenCL subset, even though it's not a part of any such subset today. If it doesn't, widespread hardware is unlikely to support it.

Personally, I like accelerators with efficient divergence. I'll say it again:

"From a programmer's perspective - bring them on! Why not have something with a standard programming interface, more efficient than CPUs, more flexible than GPUs - and running existing GPGPU programs almost as well as GPUs?"

From an evolutionary standpoint though, it's quite the uphill battle. The CPU + GPU combination is considered "good enough" very widely. It's not impossible to grow in the shadow of a "good enough", established competitor. x86 was good enough and ARM got big, gcc was good enough and LLVM got big, etc.

It's just hard, especially if you can't replace the competitor anywhere and you aren't a must-have. A CPU is a must-have and ARM replaces x86 where it wins. A compiler is a must-have and LLVM replaces gcc where it wins. An OpenCL accelerator with efficient divergence - or any other kind, really - is not a must-have and it will replace neither the CPU nor the GPU. So it's quite a challenge to convince someone to spend on it.

Conclusion

I doubt that general-purpose OpenCL accelerators will displace GPGPU, even though it could be a nice outcome. The accelerators probably will find their uses though. The following properties seem favorable to them (all or a subset may be present in a given domain):

  • Small to medium scale, similarly to FPGAs
  • Long life cycles encouraging speculative investment in hardware
  • Device-specific software that can live without performance portability

In other words, there can be "life between the CPU and the GPU", though not necessarily in the highest volume devices.

Good luck to Adapteva with their Kickstarter project - a computer with 64 independent OpenCL cores for $99.

Do you really want to be making this much money when you're 50?

"Do You Really Want to be Doing This When You're 50?"

Well, I didn't really want to be doing this when I was 20. I'm in it for the money. As long as there's money in programming, I'll stay for the money, in all likelihood.

What else do you want to be doing when you're 50? Give me a profession remotely close to programming in the following ways:

  • Little or no required education
  • Good compensation, even for mediocre performers
  • Millions of jobs
  • No physical effort
  • No health or legal risks

Programming is money for nothing. Programming is very easy to enter and extremely hard to quit. What would you do instead?

I work with three lawyers - two became programmers and one became a PM. I haven't met programmers who became lawyers. I do know an engineer - not a programmer - who became a patent attorney (reported reason: "at some point, you resent your manager being the age of your kids"). Would you like to become a patent attorney when you're 50?

I had a manager who decided he'd rather be a school teacher, thinking that this line of work is more beneficial to society. He quit after 8 months, saying in his parting interview to a mainstream newspaper: "Sometimes I just want to enter the classroom with a machine gun and open fire". He's with Samsung now; he feels that his contribution to smartphone imagers benefits society substantially enough.

One of my roommates at work has been studying a bunch of things for a while now. He's got a degree in psychology and in something called Visual Theater. He's been programming part-time all the while, which is how he financed his studies. He's programming as a part of his visual performances (there's computer music involved). He'll likely be programming to finance his art work. I'm not sure he plans to quit programming at any defined point.

I've seen a lot of people "quitting" to study anything from physics to philosophy, and then going back to programming. The money is addictive. There are many other sources of satisfaction, of course - which is why I run this blog for free - but much of this satisfaction has to do with demand, directly or indirectly, and is thus very much related to money. "Building something useful" and "making money" are close relatives.

You could, of course, become independently wealthy. But you probably won't, and then programming is your plan B. There's also a thing about material wealth - it's easily taken away. I'm from Soviet Russia, so I tend to exaggerate the likelihood of that - but really, property is easily confiscated, and paper money can become paper overnight. It's not just a USSR thing; the US confiscated gold from its citizens at about the same time as the USSR. Professional ability, however, can't be confiscated. The prudent (paranoid?) independently wealthy programmer will thus make some effort to stay in a good shape.

There's the argument that professional programming is stressful. Again - compared to what? A doctor's work? A lawyer's work? Answering calls by irate customers while your responses are recorded for later inspection?

What stress? Programmers who can program at all - as in, print out a binary tree correctly - are very scarce. This scarcity makes it rather hard to push programmers around. You can try to bully them into doing unpaid overtime, but they quickly learn that it's a seller's market, and that you're basically bluffing. You have nobody to replace them with.

With demand outstripping supply, there's enough space in programming for everyone. This makes for a not-so-competitive environment, compared to, say, finance/investment banking type of jobs. Programmers are also typically shielded from customers and senior management - the kind of people who're always right, a trait making communication somewhat tiresome.

Deadlines? Sure, we have them, just like everybody else. Let's admit it though - we tend to miss them, and it's not very stressful to us unless we want it to be. If you're given an impossible schedule, and you do your best, and you miss the deadline, you can suffer deeply or you can maintain mental peace. The fact is that your material well-being is rarely in jeopardy because of a missed deadline, so your reaction is fully up to you.

There's the argument that programmers can't fully understand what's going on, what with all the APIs and layers and stuff. And if you don't understand your own environment, that's stressful and that's not fun. Fair enough; but again - who does understand his environment more than a programmer? A doctor digging into a patient's guts? A lawyer sifting through legal documents? An investor trading financial derivatives? A manager overseeing 10 or 20 programmers? With all the self-inflicted complexity, we're still in a better shape than most.

The fact is that there are relatively few programmers in their fifties around. Does it mean people don't survive in programming though? More likely, it is simply a result of growth. There were few 20 year old programmers 30 years ago - compared to 10 years ago. Therefore, there are fewer 50 year old programmers today than 30 year old programmers. To the extent that the growth in programming slows down, things will be different 20 years down the road.

So I'm not planning to quit programming, not because it's such a great source of joy by itself, but because it looks so good compared to just about anything else. Maybe not the most "passionate" statement - but passion burns out, whereas greed is sustainable. And if you plan to quit programming, I wonder what your alternative is, and I won't be surprised if you come back to programming in a few years.

C as an intermediate language

Here's a Forth program debugged in KDevelop - a graphical debugger without Forth support:

KDevelop used to debug Forth code

Cool stuff, not? The syntax highlighting for Forth files - that's someone else's work that comes with the standard KDevelop installation. But the rest - being able to run Forth under KDevelop, place breakpoints, and look at program state - all that stuff is something we'll develop below.

I'll show all the required code; we don't have to do very much, because we get a lot for free by using C as an intermediate language.

A high-level intermediate language is not unusual. A lot of compilers target an existing high-level platform instead of generating native code - for instance, by generating JVM bytecode or JavaScript source code. Why? Because of all the things you get for free that way:

  • Portability
  • Optimization
  • Some degree of library interoperability
  • Some degree of tools interoperability (IDEs, debuggers, etc.)

A few languages targeting high-level platforms are rather well-known: Scala and Clojure with compilers targeting the JVM, CoffeeScript and Dart which are compiled to JavaScript. (Then there's Java, which Google famously compiles to JavaScript - though that remains somewhat offbeat.)

Which languages of this kind are the most successful? Without doubt, today the answer is C++ and Objective-C - languages whose first compilers emitted C code.

I think C is an awesome intermediate language for a compiler to emit. It's extremely portable, it compiles in a snap, optimizes nicely, and you get interoperability with loads of stuff.

When I wanted to make a compiler for an interpreted language we developed internally, I actually thought about targeting a VM, not a source language. I planned to emit LLVM IR. It was GD who talked me out of it; and really, why LLVM IR?

After all, LLVM IR is less readable than C, less stable than the C standard - and less portable than C. It will likely always be, almost by definition.

Even if LLVM runs on every hardware platform, LLVM IR will only be supported by the LLVM tools - but not, say, the GNU tools, or Visual Studio. Whereas generating C code gives you great support by LLVM tools - and GNU, and Visual Studio. Debugging a program generated from LLVM IR in Visual Studio will probably always be inferior to debugging auto-generated C code compiled by the Visual Studio compiler.

"C as an intermediate language" is one of those things I wanted to write about for years. What prevented me was, I'd like to walk through an example - including some of the extra work that may be required for better debugging support. But I couldn't think of a blog-scale example ("web-scale" seems to be the new synonym for "loads of data"; I propose "blog-scale" to mean "small enough to fully fit into a blog post".)

Then it dawned on me: Forth! The minimalist language I've fallen out of love with that still has a warm place in my heart.

So, I'll do a Forth-to-C compiler. That'll fit in a blog post - at least a small Forth subset will - and Forth is different enough from C to be interesting. Because my point is, it doesn't have to be C with extensions like C++ or Objective-C. It can be something rather alien to C and you'll still get a lot of mileage out of the C tools supplied by your platform.

Without further ado, let's implement a toy Forth-to-C compiler. We shall:


Enough Forth to not be dangerous

To be dangerous, we'd have to support CREATE/DOES> or COMPILE or POSTPONE or something of the sort. We won't - we'll only support enough Forth to implement Euclid's GCD.

So here's our Forth subset - you can skip this if you know Forth:

  • Forth has a data stack.
  • Integers are pushed onto the stack. When you say 2 3, 2 is pushed and then 3.
  • Arithmetic operators pop operands from the stack and push the result. 6 2 / pops 6 and 2 and pushes 6/2=3. 2 3 = pushes 0 (false), because 2 is not equal to 3.
  • Stack manipulation words, well, manipulate the stack. DUP duplicates the top of the stack: 2 DUP is the same as 2 2. Swap: 2 3 SWAP is the same as 3 2. Tuck: 2 3 TUCK is the same as… errm… 3 2 3. As you can already imagine, code is more readable with less of these words.
  • New words are defined with : MYNAME …code… ; Then if you say MYNAME, you'll execute the code, and return to the point of call when you reach the semicolon. No "function arguments" are declared - rather, code pops arguments from the stack, and pushes results. Say, : SQUARE DUP * ; defines a squaring word; now 3 SQUARE is the same as 3 DUP * - it pops 3 and pushes 9.
  • Loops: BEGIN … cond UNTIL is like do { … } while(!cond), with cond popped by the UNTIL. BEGIN … cond WHILE … REPEAT is like while(1) { … if(!cond) break; … }, with cond popped by the WHILE.
  • Printing: 2 . prints "2 ". CR prints a newline.
  • Forth is case-insensitive.

That's all - enough to implement GCD, and to scare and startle your infix-accustomed friends.


"Compilation strategy"

…A bit too dumb to be called a strategy; anyway, our blog-scale, blog-strength compiler will work as follows:

  • The stack is an array of "data"; data is a typedef for long.
  • Each user-defined word is compiled to a C function getting a data* pointing to the top of stack (TOS), and returning the new TOS pointer.
  • Where possible, built-in words can be used similarly to user-defined words: s=WORD(s).
  • Some words can't work that way: + can't become s=+(s) because that's not valid C. For such words, we do some trivial translation. 2 becomes PUSH(2), + becomes OP2(+). Similarly, control flow words are translated to do/while, if and break.
  • The stack grows downwards and shrinks upwards: if s[0] is the TOS, s[1] is the element below it, etc; s++ shrinks the stack by popping the TOS.

That's all; for instance, the gcd example from here:

: gcd begin dup while tuck mod repeat drop ;

…compiles to the C code below. As you can see, every word is translated in isolation - the compiler has almost no comprehension of "context" or "grammar":

data* gcd(data* s) { //: gcd
  do {               //begin
    s=dup(s);        //dup
    if(!*s++) break; //while
    s=tuck(s);       //tuck
    OP2(%);          //mod
  } while(1);        //repeat
  s=drop(s);         //drop
  return s;          //;
}

Here's the full Python source of the compiler (forth2c.py) - a bit redundant after all the explanations:

#!/usr/bin/python
import sys
# "special" words - can't emit s=WORD(s)
special = {
  ':':'data* ',
  ';':'return s; }',
  '.':'PRINT();',
  'begin':'do {',
  'until':'} while(!*s++);',
  'repeat':'} while(1);',
  'while':'if(!*s++) break;',
}
# binary operators
binops='+ - * / = < > mod'.split()
op2c={'=':'==','mod':'%'}

def forth2c(inf,out):
  n = 0
  for line in inf:
    n += 1
    # emit line info for C tools (debuggers, etc.)
    # - a nice option C gives us
    print >> out,'\n#line',n,'"%s"'%infile
    for token in line.lower().strip().split():
      if token in special:
        print >> out,special[token],
      else:
        try:
          num = int(token)
          print >> out, 'PUSH(%d);'%num,
        except ValueError:
          if token in binops:
            print >> out,'OP2(%s);'%op2c.get(token,token),
          else:
            if defining:
              print >> out,token+'(data* s) {',
            else: # call
              print >> out,'s=%s(s);'%token,
      defining = token == ':'

out = open('forth_program.c','w')
print >> out, '#include "forth_runtime.h"'
for infile in sys.argv[1:]:
  forth2c(open(infile),out)

And here's the "runtime library", if you can call it that (forth_runtime.h). It's the kind of stack-fiddling you'd expect: s++, s--, s[0]=something.

#include <stdio.h>
typedef long data;

#define PUSH(item) (--s, *s = (item))
#define OP2(op) (s[1] = s[1] op s[0], ++s)
#define PRINT() (printf("%ld ", s[0]), ++s)
#define cr(s) (printf("\n"), s)
#define drop(s) (s+1)
#define dup(s) (--s, s[0]=s[1], s)
#define tuck(s) (--s, s[0]=s[1], s[1]=s[2], s[2]=s[0], s)

#define MAX_DEPTH 4096 //arbitrary
data stack[MAX_DEPTH];
data* forth_main(data*);
int main() { forth_main(stack+MAX_DEPTH); return 0; }

Note that our Forth dialect has an entry point word - forth_main - that is passed a pointer to an empty stack by C's main. Real Forth doesn't have an entry point - it's more like a scripting language that way; but the forth_main hack will do for our example.

That's all! A Forth dialect in under 60 LOC. By far the shortest compiler I ever wrote, if not the most useful.

A test

Let's test our forth2c using a few example source files. countdown.4th (from here):

: COUNTDOWN
  BEGIN
    DUP .
    1 -
    DUP 0 =
  UNTIL
  DROP
;

gcd.4th (a multi-line version - to make single-stepping easier):

: gcd
  begin
    dup
  while
    tuck
    mod
  repeat
  drop
;

main.4th:

: forth_main
  5 countdown
  cr
  10 6 gcd .
  35 75 gcd .
  12856 3248 gcd .
  cr
;

We can run the program with the shell script:

./forth2c.py countdown.4th gcd.4th main.4th
gcc -g -static -Wall -o forth_program forth_program.c
./forth_program

This should print:

5 4 3 2 1
2 5 8

So countdown manages to count down from 5 to 1, and gcd finds the gcd. Good for them.


Debugging

Let's try our program with the trusty gdb:

$ gdb forth_program
There is NO WARRANTY!! etc. etc.
(gdb) b countdown # place breakpoint
Breakpoint 1: file countdown.4th, line 3.
(gdb) r # run
Breakpoint 1, countdown (s=0×804e03c) at countdown.4th:3
(gdb) bt # backtrace
#0  countdown (s=0×804e03c) at countdown.4th:3
#1  forth_main (s=0×804e03c) at main.4th:2
#2  main () at forth_runtime.h:14
(gdb) l # list source code
1	: COUNTDOWN
2	  BEGIN
3	    DUP .
4	    1 -
5	    DUP 0 =
6	  UNTIL
7	  DROP
8	;
(gdb)

Yay!!

Seriously, yay. We placed a breakpoint. We got a call stack - forth_main called countdown.  And we've seen the Forth source code of the function we debug. All that in a debugger that has no idea about Forth.

Partly it's due to compiling to high-level primitives, such as functions, that are supported by tools - we'd get that in every language. And partly it's due to #line - something we don't get everywhere.

#line is brilliant - all languages should have it. That's how we tell debuggers where our assembly code is really coming from - not the intermediate C code, but the real source.


Profilers and other tools

It's not just debuggers that understand #line - it's also profilers, program checkers, etc.

Here's KCachegrind showing the profile of our Forth program, obtained using Valgrind as follows:

valgrind --tool=callgrind ./forth_program
kcachegrind `ls -t callgrind* | head -1`

KCachegrind profiling Forth code

Now isn't that spiffy?

C compilers are also aware of #line - which we could try to abuse by pushing error reporting onto the C compiler. Say, if we use an undefined word do_something, we get an error at just the right source code line:

main.4th:3: implicit declaration of function ‘do_something

A very sensible error message - and we didn't have to do anything! Now let's try a BEGIN without a REPEAT - let's delete the REPEAT from countdown.4th:

gcd.4th:1: error: expected ‘while’ before ‘data’

Um, not a very good message. Perhaps this isn't such a good idea after all. If we want quality error reporting, we have to do the error checking at the source language level.


Custom data display: gdb pretty-printers

Things look rather nice. C tools - gdb, Valgrind, KCachegrind - are doing our bidding, aren't they?

Except for the data stack. Instead of the stack elements, gdb shows us the TOS pointer in hexadecimal: countdown (s=0×804e03c), forth_main (s=0×804e03c), which looks a bit lame.

It's not that the local variable s is useless - on the contrary, that's what we use to look at the data stack. This can be done using gdb commands such as p s[0] (prints the TOS), p s[1] (prints the element below TOS), etc. But we'd much rather look at the whole stack at a time, so that we can see how TUCK tucks our numbers into the stack as we single-step our code.

Can it be done?

  • The good news are that it's easy enough with gdb pretty-printers. (To me it became easy after Noam told me about the pretty-printers.)
  • The bad news are that, unlike the case with the universally supported #line, it's impossible to do custom pretty-printing in a uniform way for all tools. There's no "#line for pretty-printing" - quite a pity.
  • But then the good news are that gdb front-ends such as KDevelop or Eclipse support gdb pretty-printers - to an extent. (KDevelop apparently does a better job than Eclipse.) So you don't have to write a GUI plugin or something equally horrible.

Here's a gdb pretty printer for displaying our Forth data stack. It prints all the elements, from bottom to top (as the Forth stack contents is usually spelled). The TOS pointer is the data* you pass for printing - typically, some function's local variable s. The bottom of the stack is known statically - it's the pointer past the end of the global stack[] array. (If we had threads, that array would be thread-local, but anyway, you get the idea.)

So here's gdb_forth.py:

class StackPrettyPrinter(object):
  bottom = None
  bot_expr = 'stack + sizeof(stack)/sizeof(stack[0])'
  def __init__(self,s):
    self.s = s
  def to_string(self):
    if not self.bottom:
      self.bottom = gdb.parse_and_eval(self.bot_expr)
    s = self.s
    i = 0
    words = []
    while s[i].address != self.bottom:
      words.append(str(s[i]))
      i += 1
    return ' '.join(reversed(words))

def stack_lookup_func(val):
  if str(val.type) == 'data *':
    return StackPrettyPrinter(val)

gdb.pretty_printers.append(stack_lookup_func)
print 'loaded Forth extensions'

gdb.pretty_printers is a list of functions which may decide to wrap gdb.Value they're passed in a pretty-printer class object. Typically they decide based on the value's type. gdb.parse_and_eval returns a gdb.Value representing the value of the given C expression. A gdb.Value has a type, an address, etc. If the gdb.Value is a pointer, you can index it as a Python list: val[ind], and if it's a struct, you can use it as a dict: val['member_name'] (we don't have an example of that one here).

If you're interested in gdb pretty printers - a couple of notes:

  • Getting values from other values is much faster than parse_and_eval which just grinds the CPU to a halt; so learning the not-that-Pythonic-while-also-not-that-C-like syntax of gdb.Value access is very worthwhile.
  • You can return "maps" or "arrays" instead of plain strings. Pretty-printers for STL maps and vectors work that way. This can be used to pretty-print high-level data structures such as dynamic objects (if you're lucky and your design matches gdb's view of things - I found a lot of devil in the details.)

Let's test-drive our pretty printer - but first, let's register it in our ~/.gdbinit:

python execfile('/your/path/to/gdb_forth.py')

And now:

$ gdb forth_program
There is NO WARRANTY!!
loaded Forth extensions # aha, that's our printout!
(gdb) b gcd
(gdb) r
Breakpoint 1, gcd (s=10 6) # and that's the stack
(gdb) python for i in range(4): gdb.execute('c')
# continue 4 times
Breakpoint 1, gcd (s=6 4)
Breakpoint 1, gcd (s=4 2)
Breakpoint 1, gcd (s=2 0)
# these were the steps of gcd(10,6)
Breakpoint 1, gcd (s=35 75)
3           dup
# new inputs - gcd(35,75). let's single-step:
(gdb) s # step (execute DUP)
4         while
(gdb) p s # print s
$1 = 35 75 75 # so DUP duplicated the 75.
(gdb) s # step (execute WHILE)
5           tuck
(gdb) p s
$2 = 35 75 # …and WHILE popped that 75.
(gdb) s # and now we execute the mighty TUCK:
6           mod
(gdb) p s
$3 = 75 35 75 # YES! Tucked the 75 right in!
(gdb) bt # and how's main's data stack looking?
#0  gcd (s=75 35 75) at gcd.4th:6
#1  forth_main (s=75 35) at main.4th:5
# well, it looks shorter - somewhat sensibly.
# (main TOS pointer is unchanged until gcd returns.)

I think it's rather nice. 10 6, 6 4, 4 2, 2 0 - a concise enough walk through Euclid's algorithm in Forth.


KDevelop with the stack display

And, finally, here's how it looks in KDevelop (where the stack displayed in the GUI changes upon every step, as one would expect from a good debugger). The stack is in the variables view on the left.

There is nothing special to be done to make things work in KDevelop - except for the standard procedure of convincing it to debug a user-supplied executable, as you might do with a C program.

Features that don't map naturally to C

Some features of a source language map more naturally to C than others. If we tried to tackle a language different from Forth - or to tackle all of Forth instead of a small subset - some features would pose hard problems.

Some such features can be handled straightforwardly with another intermediate language. For example, dynamic data attributes map to JavaScript much more nicely than they do to C.

But in many cases you don't pick your intermediate language because it's the most suitable one for your source language. If you want to target browsers, then it pretty much has to be JavaScript, even if it's a poor fit for your source language. Similarly, in other situations, it has to be C - or JVM, or .NET, etc.

In fact, many language designers made it their key constraint to play nicely with their platform - the intermediate language. Some advertise the harmonious integration with the platform in the language name: C++, Objective-C, CoffeeScript, TypeScript. They know that many users are more likely to choose their language because of its platform than for any other reason, because they're locked into the platform.

With that said, here are a few features that don't map straightforwardly to C, and possible ways to handle them.

  • Dynamic binding: easy enough. Generate something like call(object,method_id), or call(object,"method_name"), multidispatch(method,obj1,obj2) - whatever you want. Should work nicely.
  • Dynamic code generation - eval, etc.: C doesn't have eval - but it does have fopen("tmp.c"), fwrite, system("gcc -o tmp.so tmp.c …"), dlopen("tmp.so"), and dlsym. This can give you something a lot like eval. You need non-portable code to make it work, but not much of it. Alternatively, if a relatively slow eval with relatively poor debugger support is OK (often it is), you can implement eval using an interpreter, which can reuse the parser of your static compiler. We use both methods in a few places and it works fine.
  • Dynamic data: Code generation is easy - access(object,"attr_name") or whatever. The ugly part is viewing these objects in debuggers. One approach is pretty-printers or similar debugger scripting. Another approach - for "static enough data" - is to generate C struct definitions at runtime, compile them to a shared library with debug info, and load the library, as you'd do to implement eval.
  • Exceptions: I use setjmp/longjmp for that; I think that's what cfront did.
  • Garbage collection: AFAIK it can work fine, funnily enough. There are garbage collectors for C, such as the Boehm collector. Do they work for C programs? Not really - they can mistake a byte pattern that happens to look like a pointer for an actual pointer, causing a leak. But you don't have that problem - because your C program is generated from code that your compiler can analyze and then tell the collector where to look for pointers. So while gc doesn't truly work for C, it can work fine for languages implemented on top of C!
  • Generics, overloading, etc.: most of this is handled by your compiler, the only problem being name mangling. You can mangle names in a way that you find least ugly, or you can mangle them according to the C++ mangling convention, so that debuggers then demangle your names back into collection<item> or some such. (The C++ mangling convention is not portable so you might need to support several.)
  • Multiple return values: we pass pointers to all return values except the first; I think returning a struct might make things look nicer in debuggers. Of course the calling convention will be less efficient compared to a compiler emitting assembly rather than C. But it will still be more efficient than your actual competition, such as a Python tuple.
  • Program reduction (as in, transform (+ (* 2 3) 5) to (+ 6 5), then to 11): ouch. I guess it could make sense to write an interpreter to do that in C, but you wouldn't get that much mileage out of the C optimizer, and no mileage at all from C debuggers, profilers, etc. The basic model of computation has to be close enough to C to gain something from the C toolchain besides the portability of your C code implementing your language. (I'm not saying that you absolutely have to create a memory representation of (+ 6 5) at runtime - just that if you want to create it, then C tools won't understand your program.)

What about using C++ instead of C?

You could, and I did. I don't recommend it. You get exceptions for free instead of fiddling with setjmp/longjmp. Then they don't work when you throw them from one shared library and catch in another (I think RTTI has similar problems with shared libraries). You get templates for free. Then your auto-generated code compiles for ages.

Eli Bendersky writes about switching from C to C++ to implement a language VM. He says it's easier to use C++ features when they match your VM's needs then to reimplement them in C.

Here's how it worked out for me. I once implemented a VM and an interpreter in C++. Then we wrote a compiler for the language. Because the VM was in C++, it was much easier for the compiler to emit C++ code interfacing with the VM than C code. And then with this auto-generated C++ code, we bumped into shared-library-related problems, and build speed problems, etc. - and we have them to this day.

But of course, YMMV. One reason to use C++ is that debuggers get increasingly better at displaying STL collections and base class object pointers (which they know to downcast to the real runtime type and show the derived class members). If your language constructs map to these C++ features, you can get better debug-time data display in a portable way.

Anything you'd like to add?

If you have experience with using C as an intermediate language, I'll gladly add your comments. I'm more experienced with some things than others in this department, and so a lot of the potential issues would likely never occur to me.

Conclusion

C is a great intermediate language, and I'd be thrilled to see more new languages compiling to C. I don't live in either Java or JavaScript these days; surely I'm not alone. A language compiling to C could be fast, portable, have nice tools and library support - and immediately usable to many in the land of C, C++ and Objective-C.

Error codes vs exceptions: critical code vs typical code

Error codes or exceptions - which is better? Here's my answer:

  1. They have the same worst case - a human error can lead to a complete disaster.
  2. Exceptions are far safer for most code.
  3. Error codes are far safer for well-reviewed, critical code.

(As you can see from 2 and 3, I believe that most code is not critical and/or poorly reviewed; I think most people will agree on that one.)

Worst case: disaster

Here's a disaster with error codes (based on an example from a critique of Go's error handling):

seal_presidential_bunker()
trigger_doomsday_device()

If we fail to seal_presidential_bunker, we still trigger_doomsday_device, because the programmer forgot to check the error code. Human error has lead to a disaster.

(The original article doesn't specify exactly what the disaster is. One problem is that the presidential staff is not safe. Another problem is that the doomsday device got triggered - which wouldn't happen if an exception were thrown and left uncaught. Which of the two problems is the bigger part of the disaster depends on your worldview.)

Here's a disaster with exceptions.

open_the_gate()
wait_for_our_men_to_come_in()
close_the_gate()

If wait_for_our_men_to_come_in throws an exception, then we'll never close_the_gate, and the enemy will sneak in. Again - human error, disaster.

So in theory, exceptions and error codes are equally bad.

Exceptions are safer for most code

Most code doesn't trigger doomsday devices, nor deals with lethal enemies at the gates. When most code messes up, garbage appears on the screen or in log files, and a programmer shows up to debug the problem.

With exceptions, it's easier for the programmer to figure out why this garbage appeared, because the failure occurs closer to the point of the error.

f=open_users_file()
print_users_list(f)

If open_users_file() throws an exception, then the programmer will see a "No such file or directory" with a call stack and think, "why couldn't this idiot [possibly, me] bother to check if the file is there?" Then he fixes the bug and all is well again.

If open_users_file() returns an invalid file object (similarly to, for example, C++'s ifstream), then print_users_list (which doesn't check errors, either) might print an empty user list. The error might then become "No such user", or "Permission denied", etc. The program will fail further from the point of error - the file opening code - and you'll need to go back and figure out where the error is.

For production code, failing early isn't necessarily better. Failing early is what leaves the gate open for the enemies in the above example. Failing early due to a floating point error - instead of trying further just in case - was reportedly the root cause of the explosion of Ariane 5, costing $.5G.

But for most code, which:

  • doesn't lead to such large damages
  • is written rather hastily
  • is expected to have bugs
  • has programmers constantly attending to it and fixing those bugs

…for most code, failing early is better simply because it always makes debugging easier - even if it doesn't make the impact of the error smaller.

Error codes have another horrible anti-debugging quality: loss of information. Even if the program fails early with error codes, you usually only get the code of the topmost layer without all the details from below.

With an exception, you get a call stack, and an error string from the bottom layer. With a perror(), you get just an error string from the bottom layer ("No such file or directory" - which file? Who wants it?). With error codes, you get something like "User list management error" - the fact that it was a file opening error gets "swallowed" by layers of code converting low-level error codes to high-level ones.

It's possible to collect an "error code call stack" with all the information, but it's almost never done. Whereas an exception does it automatically for the laziest of programmers. Another win for whoever gets to debug the code.

Error codes are safer for well-reviewed code

Code reviews are generally easier with error codes than exceptions. Error codes mean that you must carefully look at function calls to see if the programmer handled the possible errors. Exceptions mean that you must imagine what happens if an exception is thrown anywhere in the flow.

Making sure that the opening of every gate is exception-safe - that the gate gets closed when an exception is thrown - is hard. C++ has RAII for the gate-closing (and Python has with and C# has using), and Java has checked exceptions for the exception-hunting.

But even if you have both and then some, it still seems hard. A program has a lot of intermediate states, and some of them don't make sense. An exception can leave you in this intermediate state. And it's not easy to wrap every entrance into an intermediate state using whatever exception-safety-wrapper your language gives you.

So I think it makes sense for Go - a language for writing critical production code - to shun exceptions. We use C++ with -fno-exceptions for serious production code and I think it's equally sensible.

It just doesn't make sense to write most of your code that way. In most of my code, I want to always fail early to make debugging easier, seeing the full context of the error, and I want that to happen without putting much thought into error handling.

And this is why I think exceptions should be embraced by all lazy programmers writing low-quality code like myself.

Aren't side effects fundamental in complexity analysis?

I'm not that good at functional programming and the related discussion of the benefits and drawbacks of side effects. Please forgive me if I'm utterly wrong - or if I'm trivially right because I'm saying something elementary.

So the thing is, very basic complexity analysis that we all use assumes a von Neumann machine with a random access memory. Random access means that you can read an arbitrary array element, and that you can write it.

Suppose we throw away the ability to write array elements, which is a side effect. Instead, we replace it with the "functional" ability to create a new array having the same elements as the old one at all indexes except for one - the one we'd write if we had side effects. So we create an array like this: {old,…,old,new,old,…,old}.

In this sense, aren't we getting a fundamentally inefficient system? We took an O(1) operation and replaced it with an O(N) operation. We don't have to use the new O(N) operation every time we previously used the old O(1) one, of course - we can use an entirely different algorithm instead. But we must be losing something.

For instance, how do you compute a histogram? With side effects, you have a loop over your elements doing histogram[array[i]]++, this is O(N) where N is the amount of elements.

Without side effects, we can create a new histogram at every step - this is O(N*K) where K is the amount of histogram bins. Or we can sort the array - O(N*log(N)) - and then have an O(N) pass over the sorted array, consing every time a new element is found. Both options are asymptotically slower, not just programming-language-shootout slower. Did I miss an O(N) option?

I assume that the people most likely to point out the drawbacks of side effects are also likely to care about asymptotic complexity, because both questions are very basic CS questions. What's the standard workaround for the loss of asymptotic efficiency, then?

One answer could be a smart compiler - smart enough to implement the logical creation of an updated histogram as an update to the memory keeping the old histogram.

But it seems that if we had a compiler that smart, then side effects are not a problem in the first place. An imperative program full of side effects can be trivially implemented as a recursive function taking a state (the entire state of the memory - one large array), creating a new state, and calling itself with this state.

The reason people object to side effects seems related to the fact that such a program is "functional" only in a pointlessly theoretical sense. The nominal lack of side effects gives you no actual added ability to analyze the data flow of such programs.

I haven't formally proved it by this example, of course, but it seems to me that a nominally functional operation of creating a new array just like the old one with a single modified element is not a very good one. It can sometimes be hard or impossible to elide - as in the "imperative program emulation" example - and when it can't be, the need to fall back to actually copying is just too disastrous.

I guess this is why functional languages have cons - make a new list from a new element, the head, and an old list, the tail - but don't have a similar "array consing" primitive for inserting a new element into a random position of an old array.

Now in the absence of a compiler smart enough for efficient "array consing", people seem to be using mutable data structures, such as Haskell's mutable arrays. This seems to repeal some of the correctness guarantees gained by not having side effects. For example:

The freezeArray action converts a mutable array into an immutable one by copying, whereas unsafeFreezeArray returns an immutable array that is effectively just the type cast version of the mutable array. Should you write to the mutable array after it has been (unsafely) frozen, you'll side-effect the immutable array in the process. Please don't :-)

Bugs introduced by side effects coupled with aliasing - you modified something at the wrong time and someone else then references that something - are back.

So what's the point of shunning side effects if you're forced to have them anyway to do some of your computations asymptotically efficiently? Is the point having a clear separation between "pure" code and side-effecting code?

From a "social" more than a "formalistic" standpoint, such a separation indeed makes a lot of sense when "pure" code computes and side-effecting code does I/O. It's awesome to be able to tell pure computations from I/O, and it's actually a pity that imperative language typically allow people to litter computational code with I/O as easily as they do.

But if side effects are also necessary for computational code, then, again from a "social" standpoint, the pure/impure separation doesn't seem like a stable thing that is likely to correspond to how people think of boundaries between parts of the system. A piece of code that doesn't use histograms today might use them tomorrow, and then its purity will be gone.

Generally it seems like a whole lot of algorithms require writable RAM to be efficient. Hough transforms are "glorified histograms". A bunch of algorithms processing graphs or matrices update "rather random" indexes/counters iteratively, with no very clear way to somehow sort these updates so that they are sufficiently non-random to become list consing or similar.

I'm not saying pure functional languages aren't fun or productive for you - a very valid "social" argument in favor of purity. I just wonder how far purity can take you from a complexity analysis standpoint.

If it's not very far, or if it's not very clear how far, then why not have side effects "by default", and use data & control flow analysis to identify places where there are no side effects after all, or where their impact is limited enough (as in, i++ is a nominal side effect but a trivial one)?

In particular, how can purity be "the" way to safe parallelism if it robs you of asymptotic performance, and you're after parallelism to get performance?

[I'm not at all sure I get this right, so I'm asking a "real" question, not a rhetorical one.]

What "Worse is Better vs The Right Thing" is really about

I thought about this one for a couple of years, then wrote it up, and left it untouched for another couple of years.

What prompted me to publish it now - at least the first, relatively finished part - is Steve Yegge's post, an analogy between the "liberals vs conservatives" debate in politics and some dichotomies in the professional worldviews of software developers. The core of his analogy is risk aversion: conservatives are more risk averse than liberals, both in politics and in software.

I want to draw a similar type of analogy, but from a somewhat different angle. My angle is, in politics, one thing that people view rather differently is the role of markets and competition. Some view them as mostly good and others as mostly evil. This is loosely aligned with the "right" and the "left" (with the caveat that the political right and left are very overloaded terms).

So what does this have to do with software? I will try to show that the disagreement about markets is at the core of the conflict presented in the classic essay, The Rise of Worse is Better. The essay presents two opposing design styles: Worse Is Better and The Right Thing.

I'll claim that the view of economic evolution is what underlies the Worse Is Better vs The Right Thing opposition - and not the trade-off between design simplicity and other considerations as the essay states.

So the essay says one thing, and I'll show you it really says something else. Seriously, I will.

And then I'll tell you why it's important to me, and why - in Yegge's words - "this conceptual framework became one of the most important tools in my toolkit" (though of course each of us is talking about his own analogy).

Specifically, I came to think that you can be for evolution or against it, and I'm naturally inclined to be against it, and once I got that, I've been trying hard to not overdo it.

***

Much of the work on technology is done in a market context. I mean "market" in a relatively broad sense - not just proprietary for-profit developments, but situations of competition. Programs compete for users, specs compete for implementers, etc.

Markets and competition have a way to evoke strong and polar opinions in people. The technology market and technical people are no exception, including the most famous and highly regarded people. Here's what Linus Torvalds has to say about competition:

Don't underestimate the power of survival of the fittest. And don't ever make the mistake that you can design something better than what you get from ruthless massively parallel trial-and-error with a feedback cycle. That's giving your intelligence much too much credit.

And here's what Alan Kay has to say:

…if there’s a big idea and you have deadlines and you have expedience and you have competitors, very likely what you’ll do is take a low-pass filter on that idea and implement one part of it and miss what has to be done next. This happens over and over again.

Linus Torvalds thus views competition as a source of progress more important than anyone's ability to come up with bright ideas. Alan Kay, on the contrary, perceives market constraints as a stumbling block insurmountable for the brightest idea.

(The fact that Linux is vastly more successful than Smalltalk in "the market", whatever market one considers, is thus fully aligned with the creators' values.)

Incidentally, Linux was derived from Unix, and Smalltalk was greatly influenced by Lisp. At one point, Lisp and Unix - the cultures and the actual software - clashed in a battle for survival. The battle apparently followed a somewhat one-sided, Bambi meets Godzilla scenario: cheap Unix boxes quickly replaced sophisticated Lisp-based workstations, which became collectible items.

The aftermath is bitterly documented in The UNIX-HATERS Handbook, groundbreaking in its invention of satirical technical writing as a genre. The book's take on the role of evolution under market constraints is similar to Alan Kay's and the opposite of Linus Torvalds':

Literature avers that Unix succeeded because of its technical superiority. This is not true. Unix was evolutionarily superior to its competitors, but not technically superior. Unix became a commercial success because it was a virus. Its sole evolutionary advantage was its small size, simple design, and resulting portability.

The "Unix Haters" see evolutionary superiority as very different from technical superiority - and unlikely to coincide with it. The authors' disdain for the products of evolution isn't limited to development driven by economic factors, but extends to natural selection:

Once the human genome is fully mapped, we may discover that only a few percent of it actually describes functioning humans; the rest describes orangutans, new mutants, televangelists, and used computer sellers.

Contrast that to Linus' admiration of the human genome:

we humans have never been able to replicate  something more complicated than what we ourselves are, yet natural selection did it without even thinking.

The UNIX-HATERS Handbook presents in an appendix Richard P. Gabriel's famous essay, The Rise of Worse Is Better. The essay presents what it calls two opposing software philosophies. It gives them names - The Right Thing for the philosophy underlying Lisp, and Worse Is Better for the one behind Unix - names I believe to be perfectly fitting.

The essay also attempts to capture the key characteristics of these philosophies - but in my opinion, it focuses on non-inherent embodiments of these philosophies rather than their core. The essay claims it's about the degree of importance that different designers assign to simplicity. I claim that it's ultimately not about simplicity at all.

I thus claim that the essay discusses real things and gives them the right names, but the wrong definitions - a claim somewhat hard to defend. Here's my attempt to defend it.

Worse is Better - because it's simpler?

Richard Gabriel defines "Worse Is Better" as a design style focused on simplicity, at the expense of completeness, consistency and even correctness. "The Right Thing" is outlined as the exact opposite: completeness, consistency and correctness all trump simplicity.

First, "is it real"? Does a conflict between two philosophies really exist - and not just a conflict between Lisp and Unix? I think it does exist - that's why the essay strikes a chord with people who don't care much about Lisp or Unix. For example, Jeff Atwood

…was blown away by The Rise of "Worse is Better", because it touches on a theme I've noticed emerging in my blog entries: rejection of complexity, even when complexity is the more theoretically correct approach.

This comment acknowledges the conflict is real outside the original context. It also defines it as a conflict between simplicity and complexity, similarly to the essay's definition - and contrary to my claim that "it's not about simplicity".

But then examples are given, examples of "winners" at the Worse Is Better side - and suddenly x86 shows up:

The x86 architecture that you're probably reading this webpage on is widely regarded as total piece of crap. And it is. But it's a piece of crap honed to an incredibly sharp edge.

x86 implementations starting with the out-of-order implementations from the 90s are indeed "honed to an incredibly sharp edge". But x86 is never criticized because of its simplicity - quite the contrary, it's criticized precisely because an efficient implementation can not be simple. This is why the multi-billion-dollar "honing" is necessary in the first place.

Is x86 an example of simplicity? No.

Is it a winner at the Worse is Better side? A winner - definitely. At the "Worse is Better" side - yes, I think I can show that.

But not if Worse Is Better is understood as "simplicity trumps everything", as the original essay frames it.

Worse is Better - because it's more compatible?

Unlike Unix and C, the original examples of "Worse Is Better", x86 is not easy to implement efficiently - it is its competitors, RISC and VLIW, that are easy to implement efficiently.

But despite that, we feel that x86 is "just like Unix". Not because it's simple, but because it's the winner despite being the worse competitor. Because the cleaner RISC and VLIW ought to be The Right Thing in this one.

And because x86 is winning by betting on evolutionary pressures.

Bob Colwell, Pentium's chief architect, was a design engineer at Multiflow - an early VLIW company which was failing, prompting him to join Intel to create their out-of-order x86 implementation, P6. In The Pentium Chronicles, he gives simplicity two thumbs up, acknowledges complexity as a disadvantage of x86 - and then explains why he bet on it anyway:

Throughout the 1980s, the RISC/CISC debate was boiling. RISC's general premise was that computer instruction sets … had become increasingly complicated and counterproductively large and arcane. In engineering, all other things being equal, simpler is always better, and sometimes much better.

…Some of my engineering friends thought I was either masochistic or irrational. Having just swum ashore from the sinking Multiflow ship, I immediately signed on to a "doomed" x86 design project. In their eyes, no matter how clever my design team was, we were inevitably going to be swept aside by superior technology. But … we could, in fact, import nearly all of RISC's technical advantages to a CISC design. The rest we could overcome with extra engineering, a somewhat larger die size, and the sheer economics of large product shipment volume. Although larger die sizes … imply higher production cost and higher power dissipation, in the early 1990s … easy cooling solutions were adequate. And although production costs were a factor of die size, they were much, much more dependent on volume being shipped, and in that arena, CISCs had an enormous advantage over their RISC challengers.

…because of having more users ready to buy them to run their existing software faster.

x86 is worse - as it's quite clear now when, in cell phones and tablets, easy cooling solutions are not adequate, and the RISC processor ARM wins big. But in the 1990s, because of compatibility issues, x86 was better.

Worse is Better, even if it isn't simpler - when The Right Thing is right technically, but not economically.

Worse is Better - because it's quicker?

Interestingly, Jamie Zawinski, who first spread the Worse is Better essay, followed a path somewhat similar to Colwell's. He "swum ashore" from Richard Gabriel's Lucid Inc., where he worked on what would become XEmacs, to join Netscape (named Mosiac at the time) and develop their very successful web browser. Here's what he said about the situation at Mosaic:

We were so focused on deadline it was like religion. We were shipping a finished product in six months or we were going to die trying. …we looked around the rest of the world and decided, if we're not done in six months, someone's going to beat us to it so we're going to be done in six months.

They didn't have to bootstrap the program on a small machine as in the Unix case. They didn't have to be compatible with an all-too-complicated previous version as in the x86 case. But they had to do it fast.

Yet another kind of economic constraint meaning that something else has to give. "We stripped features, definitely". And the resulting code was, according to jwz - not simple, but, plainly, not very good:

It's not so much that I was proud of the code; just that it was done. In a lot of ways the code wasn't very good because it was done very fast. But it got the job done. We shipped - that was the bottom line.

Worse code is Better than not shipping on time - Worse is Better in its plainest form. And nothing about simplicity.

Here's what jwz says about the Worse is Better essay - and, like Jeff Atwood, he gives a summary that doesn't summarize the actual text - but summarizes "what he feels it should have been":

…you should read it. It explains why mediocrity has better survival characteristics than perfection…

The essay doesn't explain that - the essay's text explains why simple-but-wrong has better survival characteristics than right-but-complex.

But as evidenced by jwz's and Atwood's comments, people want it to explain something else - something about perfection (The Right Thing) versus less than perfection (Worse is Better).

Worse is Better evolutionary

And it seems that invariably, what forces you to be less than perfection, what elects worse-than-perfect solutions, what "thinks" they're better, is economic, evolutionary constraints.

Economic constraints is what may happen to select for simplicity (Unix), compatibility (x86), development speed (Netscape) - or any other quality that might result in an otherwise worse product.

Just like Alan Kay said - but contrary to the belief of Linus Torvalds, the belief that ultimately, the result of evolution is actually better than anything that could have been achieved through design without the feedback of evolutionary pressure.

From this viewpoint, Worse Is Better ends up actually better than any real alternative - whereas from Alan Kay's viewpoint, Worse Is Better is actually worse than what's achievable.

(A bit convoluted, not? In fact, Richard Gabriel wrote several follow-ups, not being able to decide if Worse Is Better was actually better, or actually worse. I'm not trying to help decide that - just to show what makes one think it's actually better or worse.)

***

That's the first part - I hope to have shown that your view of evolution has a great effect on your design style.

If evolution is in the center of your worldview, if you think about viability as more important than perfection in any area, then you'll tend to design in a Worse Is Better style.

If you think of evolutionary pressure as an obstacle, an ultimately unimportant, harmful distraction on the road to perfection, then you'll prefer designs in The Right Thing style.

But why do people have a different view of evolution in the first place? Is there some more basic assumption underlying this difference? I think I have more to say about this, though it's not in nearly as finished form as the first part, and I might write about it all in the future.

Meanwhile, I want to conclude this first part with some thoughts on why it all matters personally to me.

I'm a perfectionist, by nature, and compromise is hard for me. Like many developers good enough to be able to implement much of their own ambitious ideas, I turned my professional life into a struggle for perfection. I wasn't completely devoid of common sense, but I did things that make me shiver today.

I wrote heuristic C++ parsers. I did 96 bit integer arithmetic in assembly. I implemented some perverted form of thread migration on the bare metal, without any kind of OS or thread support. I did many other things that I'm too ashamed to admit.

None of it was really needed, not if you asked me today. It was "needed" in the sense of being a step towards a too-good-for-my-own-good, "perfect" solution. Today I'd realize that this type of perfection is not viable anyway (in fact, none of these monstrosities survived in the long run.) I'd choose a completely different path that wouldn't require any such complications in the first place.

But my stuff shipped. I was able to make it work.You don't learn until you fail - at least I didn't. Perfectionists are stubborn.

Then at one point I failed. I had to throw out months worth of code, having realized that it's not going to fly.

And it so happened that I was reading Unix-Haters, and I was loving it, because I'm precisely the type of perfectionist that these people are, or close enough to identify with them. And there was this essay there about Worse Is Better vs The Right Thing.

And I was reading it when I wrote the code soon to be thrown out, and I was reading it when I decided to throw it out and afterwards.

And I suddenly started thinking, "This is not going to work, this type of thing. With this attitude, if you want it all, consistency, completeness, correctness - you'll get nothing, because you will fail, completely. You're too dumb, I mean I am, also not enough time. You have to choose, you're not going to get it all so you better decide what you want the most and aim at that."

If you read the Unix-Haters, you'll notice a lot of moral outrage - perfectionists have that, moral outrage at something imperfect. Especially at someone who knowingly chooses to aim at less than perfection. Especially if it's due to the ulterior motive of wanting to succeed.

And I felt a counter-outrage, for the first time. "What do you got to show, you got nothing. What good are your ideals if you end up dead? Dead bodies smell bad to us for a reason. Technical superiority without evolutionary superiority? Evolutionary inferiority means "dead". How can "dead" be technically superior? What have the dead ever done for us?"

It was huge, for me. I mean, it took a few years to truly sink in, but that was the start. I've never done anything Right since. And I've been professionally happy ever after. I guess it's a kind of "having swum ashore".

"It's done in hardware so it's cheap"

It isn't.

This is one of these things that look very obvious to me, to the point where it seems not worth discussing. However, I've heard the idea that "hardware magically makes things cheap" from several PhDs over the years. So apparently, if you aren't into hardware, it's not obvious at all.

So why doesn't "hardware support" automatically translate to "low cost"/"efficiency"? The short answer is, hardware is an electric circuit and you can't do magic with that, there are rules. So what are the rules? We know that hardware support does help at times. When does it, and when doesn't it?

To see the limitations of hardware support, let's first look at what hardware can do to speed things up. Roughly, you can really only do two things:

  1. Specialization - save dispatching costs in speed and energy.
  2. Parallelization - save time, but not energy, by throwing more hardware at the job.

Let's briefly look at examples of these two speed-up methods - and then some examples where hardware support does nothing for you, because none of the two methods helps. We'll only consider run time and energy per operation, ignoring silicon area (considering a third variable just makes it too hairy).

I'll also discuss the difference between real costs of operations and the price you pay for these operations, and argue that in the long run, costs are more stable and more important than prices.

Specialization: cheaper dispatching

If you want to extract bits 3 to 7 of a 32-bit word and then multiply them by 13 - let's say an encryption algorithm requires this - you can have an instruction doing just that. That will be faster and use less energy than, say, using bitwise AND, shift & multiplication instructions.

Why - what costs were cut out? The costs of dispatching individual operations - circuitry controlling which operation is executed, where the inputs come from and where the outputs go.

Specialization can be taken to an extreme. For instance, if you want a piece of hardware doing nothing but JPEG decoding, you can bring dispatching costs close to zero by having a single "instruction" - "decode a JPEG image". Then you have no flexibility - and none of the "overhead" circuitry found in more flexible machines (memory for storing instructions, logic for decoding these instructions, multiplexers choosing the registers that inputs come from based on these instructions, etc.)

Before moving on, let's look a little closer at why we won here:

  • We got a speed-up because the operations were fast to begin with - so dispatching costs dominated. With specialization, we need 4 wires connected directly to bits 3 to 7 that have tiny physical delay - just the time it takes the signal to travel to a nearby multiplier-by-13. Without specialization, we'd use a shifter shifting by a configurable amount of bits - 3 in our case but not always - which is a bunch of gates introducing a much larger delay. On top of that, since we'd be using several such circuits communicating through registers (let's say we're on a RISC CPU), we'd have delays due to reading and writing registers, delays due to selecting registers from a large register file, etc. With all this taken out by having a specialized instruction, no wonder we're seeing a big speed-up.
  • Likewise, we'll see lower energy consumption because the operations didn't require a lot of energy to begin with. Roughly, most of the energy is consumed when a signal value changes from 1 to 0 or back. When we use general-purpose instructions, most of the gate inputs & outputs and most flip-flops changing their values are those implementing the dispatching. When we use a specialized instruction, most of the switching is gone.

This means that, unsurprisingly, there's a limit to efficiency - the fundamental cost of the operations we need to do, which can't be cut.

When the operations themselves are costly enough - for instance, memory access or floating point operations - then their cost dominates the cost of dispatching. So specialized instructions that cut dispatching costs will give us little or nothing.

Parallelization: throwing more hardware at the job

What to do when specialization doesn't help? We can simply have N processors instead of one. For the parts that can be parallelized, we'll cut the run time by N - but spend the same amount of energy. So things got faster but not necessarily cheaper. A fixed power budget limits parallelization - as does a fixed budget of, well, money (the price of a 1000-CPU rack is still not trivial today).

[Why have multicore chips if it saves no energy? Because a multicore chip is cheaper than many single core ones, and because, above a certain frequency, many low-frequency cores use less energy than few high-frequency ones.]

We can combine parallelization with specialization - in fact it's done very frequently. Actually a JPEG decoder mentioned above would do that - a lot of its specialized circuits would execute in parallel.

Another example is how SIMD or SIMT processors broadcast a single instruction to multiple execution units. This way, we get only a speed-up, but no energy savings at the execution unit level: instead of one floating point ALU, we now have 4, or 32, etc. We do, however, get energy savings at the dispatching level - we save on program memory and decoding logic. As always with specialization, we pay in flexibility - we can't have our ALUs do different things at the same time, as some programs might want to.

Why do we see more single-precision floating point SIMD than double-precision SIMD? Because the higher the raw cost of operations, the less we save by specialization, and SIMD is a sort of specialization. If we have to pay for double-precision ALUs, why not put each in a full-blown CPU core? That way, at least we get the most flexibility, which means more opportunities to actually use the hardware rather than keeping it idle.

(It's really more complicated than that because SIMD can actually be a more useful programming model than multiple threads or processes in some cases, but we won't dwell on that.)

What can't be done

Now that we know what can be done - and there really isn't anything else - we basically already also know what can't be done. Let's look at some examples.

Precision costs are forever

8-bit integers are fundamentally more efficient than 32-bit floating point, and no hardware support for any sort of floating point operations can change this.

For one thing, multiplier circuit size (and energy consumption) is roughly quadratic in the size of inputs. IEEE 32b floating point numbers have 23b mantissas, so multiplying them means a ~9x larger circuit than an 8×8-bit multiplier with the same throughput. Another cost, linear in size, is that you need more memory, flip-flops and wires to store and transfer a float than an int8.

(People are more often aware of this one because SIMD instruction sets usually have fixed-sized registers which can be used to keep, say, 4 floats or 16 uint8s. However, this makes people underestimate the overhead of floating point as 4x - when it's more like 9x if you look at multiplying mantissas, not to mention handling exponents. Even int16 is 4x more costly to multiply than int8, not 2x as the storage space difference makes one guess.)

We design our own chips, and occasionally people say that it'd be nice to have a chip with, say, 256 floating point ALUs. This sounds economically nonsensical - sure it's nice and it's also quite obvious, so if nobody makes such chips at a budget similar to ours, it must be impossible, so why ask?

But actually it's a rather sensible suggestion, in that you can make a chip with 256 ALUs that is more efficient than anything on the market for what you do, but not flexible enough to be marketed as a general-purpose computer. That's precisely what specialization does.

However, specialization only helps with operations which are cheap enough to begin with compared to the cost of dispatching. So this can work with low-precision ALUs, but not with high-precision ALUs. With high-precision ALUs, the raw cost of operations would exceed our power budget, even if dispatching costs were zero.

Memory indirection costs are forever

I mentioned this in my old needlessly combative write-up about "high-level CPUs". There's this idea that we can have a machine that makes "high-level languages" run fast, and that they're really only slow because we're running on "C machines" as opposed to Lisp machines/Ruby machines/etc.

Leaving aside the question of what "high-level language" means (I really don't find it obvious at all, but never mind), object-orientation and dynamic typing frequently result in indirection: pointers instead of values and pointers to pointers instead of pointers. Sometimes it's done for no apparent reason - for instance, Erlang strings that are kept as linked lists of ints. (Why do people even like linked lists as "the" data structure and head/tail recursion as "the" control structure? But I digress.)

This kind of thing can never be sped up by specialization, because memory access fundamentally takes quite a lot of time and energy, and when you do p->a, you need one such access, and when you do p->q->a, you need two, hence you'll spend twice the time. Having a single "LOAD_LOAD" instruction instead of two - LOAD followed by a LOAD - does nothing for you.

All you can do is parallelization - throw more hardware at the problem, N processors instead of one. You can, alternatively, combine parallelization with specialization, similarly to N-way floating point SIMD that's somewhat cheaper than having N full-blown processors. For example, you could have several load-store units and several cache banks and a multiple-issue processor. Than if you had to run p1->q1->a and somewhere near that, p2->q2->b, and the pointers point into different banks, some of the 4 LOADs would end up running in parallel, without having several processors.

But, similarly to low-precision math being cheaper whatever the merits of floating point SIMD, one memory access is always twice cheaper than two despite the merits of cache banking and multiple issue. Specifically, doubling the memory access throughput roughly doubles the energy cost. This can sometimes be better than simply using two processors, but it's a non-trivial cost and will always be.

A note about latency

We could discuss other examples but these two are among the most popular - floating point support is a favorite among math geeks, and memory indirection support is a favorite among language geeks. So we'll move on to a general conclusion - but first, we should mention the difference between latency costs and throughput costs.

In our two examples, we only discussed throughput costs. A floating point ALU with a given throughout uses more energy than an int8 ALU. Two memory banks with a given throughput use about twice the energy of a single memory bank with half the throughput. This, together with the relatively high costs of these operations compared to the costs of dispatching them, made us conclude that we have nothing to do.

In reality, the high latency of such heavyweight operations can be the bigger problem than our inability to increase their throughput without paying a high price in energy. For example, consider the instruction sequence:

c = FIRST(a,b)
e = SECOND(c,d)

If FIRST has a low latency, then we'll quickly proceed to SECOND. If FIRST has a high latency, then SECOND will have to wait for that amount of time, even if FIRST has excellent throughput. Say, if FIRST is a LOAD, being able to issue a LOAD every cycle doesn't help if SECOND depends on the result of that LOAD and the LOAD latency is 5 cycles.

A large part of computer architecture is various methods for dealing with these latencies - VLIW, out-of-order, barrel processors/SIMT, etc. These are all forms of parallelization - finding something to do in parallel with the high-latency instruction. A barrel processor helps when you have many threads. An out-of-order processor helps when you have nearby independent instructions in the same thread. And so on.

Just like having N processors, all these parallelization methods don't lower dispatching costs - in fact, they raise them (more registers, higher issue bandwidth, tricky dispatching logic, etc.) The processor doesn't become more energy efficient - you get more done per unit of time but not per unit of energy. A simple processor would be stuck at the FIRST instruction, while a more clever one would find something to do - and spend the energy to do it.

So latency is a very important problem with fundamentally heavyweight operations, and machinery for hiding this latency is extremely consequential for execution speed. But fighting latency using any of the available methods is just a special case of parallelization, and in this sense not fundamentally different from simply having many cores in terms of energy consumed.

The upshot is that parallelization, whether it's having many cores or having single-core latency-hiding circuitry, can help you with execution speed - throughput per cycle - but not with energy efficiency - throughput per watt.

The latency of heavyweight stuff is important and not hopeless; its throughput is important and hopeless.

Cost vs price

"But on my GPU, floating point operations are actually as fast as int8 operations! How about that?"

Well, a bus ticket can be cheaper than the price of getting to the same place in a taxi. The bus ticket will be cheaper even if you're the only passenger, in which case the real cost of getting from A to B in a bus is surely higher than the cost of getting from A to B in a taxi. Moreover, a bus might take you there more quickly if there are lanes reserved for buses that taxis are not allowed to use.

It's basically a cost vs price thing - math and physics vs economics and marketing. The fundamentals only say that a hardware vendor always can make int8 cheaper than float - but they can have good reasons not to. It's not that they made floats as cheap as int8 - actually, they made int8 as expensive as floats in terms of real costs.

Just like you going alone in a bus designed to carry dozens of people is an inefficient use of a bus, using float ALUs to process what could be int8 numbers is an inefficient use of float ALUs. Similarly, just like transport regulations can make lanes available for buses but not cars, an instruction set can make fetching a float easy but make fetching a single byte hard (no load byte/load byte with sign extension instructions). But cars could use those lanes - and loading bytes could be made easy.

As a passenger, of course you will use the bus and not the taxi, because economics and/or marketing and/or regulations made it the cheaper option in terms of price. Perhaps it's so because the bus is cheaper overall, with all the passengers it carries during rush hours. Perhaps it's so because the bus is a part of the contract with your employer - it's a bus carrying employees towards a nearby something. And perhaps it's so because the bus is subsidized by the government. Whatever the reason, you go ahead and use the cheaper bus.

Likewise, as a programmer, if you're handed a platform where floating point is not more expensive or even cheaper than int8, it is perhaps wise to use floating point everywhere. The only things to note are, the vendor could have given you better int8 performance; and, at some point, a platform might emerge that you want to target and where int8 is much more efficient than float.

The upshot is that it's possible to lower the price of floating point relative to int8, but not the cost.

What's more "important" - prices or costs?

Prices have many nice properties that real costs don't have. For instance, all prices can be compared - just convert them all to your currency of choice. Real costs are hard to compare without prices: is 2x less time for 3x more energy better or worse?

In any discussion about "fundamental real costs", there tend to be hidden assumptions about prices. For example, I chose to ignore area in this discussion under the assumption that area is usually less important than power. What makes this assumption true - or false - is the prices fabs charge for silicon production, the sort of cooling solutions that are marketable today (a desktop fan could be used to cool a cell phone but you couldn't sell that phone), etc. It's really hard to separate costs from prices.

Here's a computer architect's argument to the effect of "look at prices, not costs":

While technical metrics like performance, power, and programmer effort make up for nice fuzzy debates, it is pivotal for every computer guy to understand that “Dollar” is the one metric that rules them all. The other metrics are just sub-metrics derived from the dollar: Performance matters because that’s what customers pay for; power matters because it allows OEMs to put cheaper, smaller batteries and reduce people’s electricity bills; and programmer effort matters because it reduces the cost of making software.

I have two objections: that prices are the effect, not the cause, and that prices are too volatile to commit to memory as a "fundamental".

Prices are the effect in the sense that, customers pay for performance because it matters, not "performance matters because customers pay for it". Or, more precisely - customers pay for performance because it matters to them. As a result - because customers pay for it - performance matters to vendors. Ultimately, the first cause is that performance matters, not that it sells.

The other thing about prices is that they're rather jittery. Even a price index designed for stability such as S&P 500 is jumping up and down like crazy. In a changing world, knowledge about costs has a longer shelf life than knowledge about prices.

For instance, power is considered cheap for desktops but expensive for servers and really expensive for mobile devices. In reality, desktops likely consume more power than servers, there being more desktops than servers. So the real costs are not like the prices - and prices change; the rise of mobile computing means rising prices for power-hungry architectures.

It seems to me that, taking the long view, the following makes sense:

  • It's best to reason in costs and project them to the relevant prices - not forget the underlying costs and "think in prices", so as to not get into habits that will become outdated when prices change.
  • If you see a high real cost "hidden" by contemporary prices, it's a good bet to assume that at some point in the future, prices will shift so that the real cost will rear its ugly head.

For example, any RISC architecture - ARM, MIPS, PowerPC, etc. - is fundamentally cheaper than, specifically, x86, in at least two ways: hardware costs - area & power - and the costs of developing said hardware. [At least so I believe; let's say that it's not as significant in my view than my other more basic examples, and I might be wrong and I'm only using this as an illustration.]

In the long run, this spells doom for the x86, whatever momentum it otherwise has at any point in time - software compatibility costs, Intel's manufacturing capabilities vs competitors capabilities, etc. Mathematically or physically fundamental costs will, in the long run, trump everything else.

In the long run, there is no x86, no ARM, no Windows, no iPhone, etc. There are just ideas. We remember ideas originating in ancient Greece and Rome, but no products. Every product is eventually outsold by another product. Old software is forgotten and old fabs rot. But fundamentals are forever. An idea that is sufficiently more costly fundamentally than a competing idea can not survive.

This is why I disagree with the following quote by Bob Colwell - the chief architect of the Pentium Pro (BTW, I love the interview and intend to publish a summary of the entire 160-something page document):

…you might say that CISC only stayed viable because Intel was able to throw a lot of money and people at it, and die size, bigger chips and so on.

In that sense, RISC still was better, which is what was claimed all along. And I said you know, there's point to be made there. I agree with you that Intel had more to do to stay competitive. They were starting a race from far behind the start line. But if you can throw money at a problem then, it's not really so fundamental technologically, is it? We look for more deep things than that, so if all the RISC/CISC thing amounted to was, you had a slight advantage economically, well, that's not as profound as it seemed back in the 80s was it?

Well, here's my counter-argument and it's not technical. The technical argument would be, CISC is worse, to the point where Intel's 32nm Medfield performs about as well as ARM-based 40nm chips in a space where power matters. Which can be countered with an economical argument - so what, Intel does have a better manufacturing ability so who cares, they still compete.

But my non-technical argument is, sure, you can be extremely savvy business-wise, and perhaps, if Intel realized early on how big mobile is going to be, they'd make a good enough x86-based offering back then and then everyone would have been locked out due to software compatibility issues and they'd reign like they reign in the desktop market.

But you can't do that forever. Every company is going to lose to some company at some point or other because you only need one big mistake and you'll make it, you'll ignore a single emerging market and that will be the end. Or, someone will outperform you technically - build a better fab, etc. If an idea is only ("only"?) being dragged into the future kicking and screaming by a very business-savvy and technically excellent company, then the idea has no chance.

The idea that will win is the idea that every new product will use. New products always beat old products - always have.

And nobody, nobody at all has made a new CISC architecture in ages. Intel will lose to a company or companies making RISC CPUs because nobody makes anything else - and it has to lose to someone. Right now it seems like it's ARM but it doesn't matter how it comes out in this round. It will happen at some point or other.

And if ARM beats x86, it won't be, straightforwardly, "because RISC is better" - x86 will have lost for business reasons, and it could have gone the other way for business reasons. But the fact that it will have lost to a RISC - that will be because RISC is technically better. That's why there's no CISC competitor to lose to.

Or, if you dismiss this with the sensible "in the long run, we're all dead" - then, well, if you're alive right now and you're designing hardware, you are not making a CISC processor, are you? QED, not?

Getting back to our subject - based on the assumption that real costs matter, I believe that ugly, specialized hardware is forever. It doesn't matter how much money is poured into general-purpose computing, by whom and why. You will always have sufficiently important tasks that can be accomplished 10x or 100x more cheaply by using fundamentally cheap operations, and it will pay off for someone to make the ugly hardware and write the ugly, low-level code doing low-precision arithmetic to make it work.

And, on the other hand, the market for general-purpose hardware is always going to be huge, in particular, because there are so many things that must be done where specialization fundamentally doesn't help at all.

Conclusion

Hardware can only deliver "efficiency miracles" for operations that are fundamentally cheap to begin with. This is done by lowering dispatching costs and so increasing throughput per unit of energy. The price paid is reduced flexibility.

Some operations, such as high-precision arithmetic and memory access, are fundamentally expensive in terms of energy consumed to reach a given throughput. With these, hardware can still give you more speed through parallelization, but at an energy cost that may be prohibitive.

Work on unimportant problems

"Work on important problems": ~40900 results.

"Work on unimportant problems": ~18 results.

– Google (at the time of writing), tempting the contrarian in me

It seems obvious that some problems are important to solve and some aren't, as in, curing cancer is more important than delivering social gaming. Often, people lament the abundance of tech firms working on ultimately unimportant stuff, and advise to work on important problems and not just chase the money.

I guess I agree that some problems are ultimately more important than others. But I don't think it follows that working on the important ones is better.

Working on unimportant problems can create important side-effects. A whole lot of mission-critical, world-changing and even life-saving tech is a by-product of "unimportant" things - time-wasting infotainment products, or personal pet projects started without a grand noble cause.

For instance, GPU hardware was developed to run first-person shooters with increasingly fancier graphics. Today, it powers some of the largest high-performance computing clusters where "important" science is done.

Other types of processors powering HPC clusters weren't designed for HPC, either. Hardware originally designed for scientific computing is dead - Cray is the iconic example - and replaced by cheaper and more powerful microprocessors designed to run things like office software. Office software arguably solves no important problem: as Berglas convincingly argues, office automation results not in increased productivity, but in increased complexity of rules and regulations.

All popular programming languages and operating systems, without a single exception I can think of, began either as personal projects or commercial projects not aiming to solve any problem "important" by itself. People hacked on the stuff for pleasure (C, Unix, Linux, Python, Ruby, PHP), or to conquer the world of businessy/officy/enterprisey software (Windows, VB, Java, C#, ASP). One language more specifically designed for the implementation of important software is Ada - but most important programs are written in something else.

It can even seem that not much important computer hardware or software came out of institutions directly dealing with important problems. Much of the Internet protocols is one big exception, and arguably so is HTML - but probably not JavaScript.

And, certainly, it's the "unimportant" social companies that made publishing and coordination via Internet universally accessible. Myself, I'm not very fond of either Facebook, Twitter, etc. or the kind of political activity that's coordinated through these sites nowadays, but it's "important", without doubt - another important side-effect of unimportant time-wasting projects.

One might wonder how anything of importance can possibly come out of, say, FarmVille. I really don't know - however, I couldn't guess how anything of importance could come out of DOOM, and it did.

And then there's a reason why so much of the best tech comes out of the least "important" markets. These markets are big, and they're free. Important problems tend to imply a smallish scale, or heavy regulation, or both. So you can't finance the work, and/or can't get any work done anyway.

Consider the aerospace software market - there aren't many planes, but a whole lot of regulation. Philip Greenspun, a software entrepreneur, a flight instructor and an expert witness in both software-related and aviation-related lawsuits, had this to say about the Colgan 3407 disaster:

Who crashed Colgan 3407? Actually the autopilot did. … The airplane had all of the information necessary to prevent this crash. The airspeed was available in digital form. The power setting was available in digital form. The status of the landing gear was available in digital form. …

How come the autopilot software on this $27 million airplane wasn’t smart enough to fly basically sensible attitudes and airspeeds? Partly because FAA certification requirements make it prohibitively expensive to develop software or electronics that go into certified aircraft. It can literally cost $1 million to make a minor change. Sometimes the government protecting us from small risks exposes us to much bigger ones.

The same is happening in the automotive market, the healthcare market, etc. There's progress, of course, just nowhere near the progress in more frivolous areas - and much of the progress in "important" areas is a byproduct of progress in frivolous areas. As in, the best system for managing patients' records may well be Google Docs that doctors access from their iPads.

By the way, the importance of an issue correlates with the stupidity of rules, not just in technology, but in most things in life. The hoops you must jump through to get an "important" product out the door are not fundamentally different from airport security checks.

The airport security theater results in little added security. Likewise, the quality theater necessarily surrounding any life-saving technology results in little added quality. However, for much the same reasons, both are unavoidable. I've been working on automotive accident prevention systems for the last decade, and as time goes by, the regularly scheduled cavity searches are only getting worse.

So if you ask me - by all means, work on unimportant problems. They're often more fun to work on, and ultimately you never know how important they really are.

Hardware macroarchitecture vs mircoarchitecture

The comp-arch.net wiki defines "computer architecture" as the union of two things:

  • Macroarchitecture - the "visible" parts, the contract between hardware and software. For example, branch instructions.
  • Microarchitecture - the "invisible" parts, the implementation strategy which affects performance but not semantics. For example, branch prediction.

I think this distinction is very interesting for three reasons:

  1. It pops up everywhere. That is, many hardware problems can be addressed at the macro level or the micro level, explicitly or implicitly.
  2. The choice of macro vs micro is rarely trivial - for most problems, there are real-world examples of both kinds of solutions.
  3. The choice has common consequences across problems. The benefits and drawbacks of macro and micro are frequently similar.

I'll use examples from diverse types of hardware - CPUs, DSPs, GPUs, FPGAs, CAPPs, and even DRAM controllers. We'll discuss some example problems and how they can be solved at the macro or micro level. I'll leave the discussion of the resulting trade-offs to separate write-ups. Here, we'll go through examples to see how practical macro and micro solutions to different problems look like.

Our examples are:

  • Data parallelism: SIMD vs SIMT
  • Multiple issue: VLIW vs superscalar
  • Running ahead: exposed latencies vs OOO
  • Local storage: bare RAM vs cache
  • Streaming access: DMA vs hardware prefetchers
  • Data processing: logic synthesis vs instruction sets
  • Local communication: routing vs register addresses
  • Avoiding starvation: pressure signals vs request aging

Data parallelism: SIMD vs SIMT

Suppose you want to have a data-parallel machine: software issues one instruction that processes multiple data items.

The common macro approach is wide registers and SIMD opcodes. To use the feature, software must explicitly break up its data into 16-byte chunks, and use special opcodes like "add_16_bytes" to process the chunks.

One mirco approach is what NVIDIA marketing calls SIMT. The instruction set remains scalar. However, hw runs multiple scalar threads at once, with simultaneously running threads all executing the same instruction. That way, 16 pairs of values are added in a single cycle using scalar instructions.

(If you're interested in SIMT, a detailed comparison with SIMD as well as SMT - the more general simultaneous multithreading model - is here.)

Multiple issue: VLIW vs superscalar/OOO

Suppose you want to have a multiple issue machine. You want to simultaneously issue multiple instructions from a single thread.

The macro approach is VLIW, which stands for "very long instruction word". The idea is, those multiple instructions you issue become "one (very long) instruction", because software explicitly asks to run them together: "ADD R0, R1, R2 and MUL R3, R0, R5". Note that ADD and MUL "see" the same value of R0: MUL gets R0's value before it's modified by ADD.

VLIW also lets software choose to say, "ADD R0, R1, R2; afterwards, MUL R3, R0, R5" - that's two separate instructions yielding vanilla serial execution. This is not only slower (2 cycles instead of 1), but has a different meaning. This way, MUL does see ADD's change to R0. Either way, you get what you explicitly asked for.

(If you're interested in VLIW, an explanation of how programs map to this rather strangely-looking architecture is here.)

The micro approach, called superscalar execution, is having the hardware analyze the instructions and run them in parallel - when that doesn't change the hw/sw contract (the serial semantics). For example, ADD R0, R1, R2 can run in parallel with MUL R3, R1, R2 - but not with MUL R3, R0, R5 where MUL's input, R0, depends on ADD. Software remains unaware of instruction-level parallelism - or at least it can remain unaware and still run correctly.

Running ahead: exposed latencies vs OOO

We've just discussed issuing multiple instructions simultaneously. A related topic is issuing instructions before a previous instruction completes. Here, the macro approach is to, well, simply go ahead and issue instructions. It's the duty of software to make sure those instructions don't depend on results that are not yet available.

For example, a LOAD instruction can have a 4-cycle latency. Then if you load to R0 from R1 and at the next cycle, add R0 and R2, you will have used the old value of R0. If you want the new value, you must explicitly wait for 4 cycles, hopefully issuing some other useful instructions in the meanwhile.

The micro approach to handling latency is called OOO (out-of-order execution). Suppose you load to R0, then add R0 and R2, and then multiply R3 and R4. An OOO processor will notice that the addition's input is not yet available, proceed to the multiplication because its inputs are ready, and execute the addition once R0 is loaded (in our example, after 4 cycles). The hw/sw contract is unaffected by the fact that hardware issues instructions before a previous instruction completes.

Local storage: local memories vs caches

Suppose you want to have some RAM local to your processor, so that much of the memory operations work with this fast RAM and not the external RAM, which is increasingly becoming a bottleneck.

The macro approach is, you just add local RAM. There can be special load/store opcodes to access this RAM, or a special address range mapped to it. Either way, when software wants to use local RAM, it must explicitly ask for it - as in, char* p = (char*)0×54000, which says, "I'll use a pointer pointing to this magic address, 0×54000, which is the base address of my local RAM".

This is done on many embedded DSPs and even CPUs - for example, ARM calls this "tightly-coupled memory" and MIPS calls this "scratch pad memory".

The micro approach is caches. Software doesn't ask to use a cache - it loads from an external address as if the cache didn't exist. It's up to hardware to:

  • Check if the data is already in the cache
  • If it isn't, load it to the cache
  • Decide which cached data will be overwritten with the new data
  • If the overwritten cached data was modified, write it back to external memory before "forgetting" it

The hardware changes greatly, the hw/sw contract does not.

Data streaming: DMA vs hardware prefetchers

Suppose you want to support efficient "streaming transfers". DRAM is actually a fairly poor random access memory - there's a big latency you pay per address. However, it has excellent throughput if you load a large contiguous chunk of data. To utilize this, a processor must issue loads without waiting for results of previous loads. Load, wait, load, wait… is slow; load, load, load, load… is fast.

The macro approach is, sw tells hw that it wants to load an array. For example, a DMA - direct memory access - engine can have control registers telling it the base address and the size of an array to load. Software explicitly programs these registers and says, "load".

DMA starts loading and eventually says, "done" - for example, by setting a bit. In the meanwhile, sw does some unrelated stuff until it needs the loaded data. At this point, sw waits until the "done" bit is set, and then uses the data.

The micro approach is, software simply loads the array "as usual". Naturally, it loads from the base address, p, then from p+1, then p+2, p+3, etc. At some point, a hardware prefetcher quietly inspecting all the loads realizes that a contiguous array is being loaded. It then speculatively fetches ahead - loads large chunks beyond p+3 (hopefully not too large - we don't want to load too much unneeded data past the end of our array).

When software is about to ask for, say, p+7, its request is suddenly satisfied very quickly because the data is already in the cache. This keeps working nicely with p+8 and so on.

Data processing: logic synthesis vs instruction sets

Let's get back to basics. Suppose we want to add a bunch of numbers. How does software tell hardware to add numbers?

The micro approach is so much more common that it's the only one that springs to mind. Why, of course hardware has an ADD command, and it's implemented in hardware by some sort of circuit. There are degrees here (should there be a DIV opcode or should sw do division?) But the upshot is, there are opcodes.

However, there are architectures where software explicitly constructs data processing operations out of bit-level boolean primitives. This is famously done on FPGAs and is called "logic synthesis" - effectively software gets to build its own circuits. (This programming model is so uncommon that it isn't even called "software", but it certainly is.)

Less famously, this is also what effectively happens on associative memory processors (CAPPs/APAs) - addition is implemented as a series of bit-level masked compare & write operations. (The CAPP way results in awfully long latencies, which you're supposed to make up with throughput by processing thousands of elements in parallel. If you're interested in CAPPs, an overview is available here.)

Of course, you can simulate multiplication using bitwise operations on conventional opcode-based processors. But that would leave much of the hardware unused. On FPGAs and CAPPs, on the other hand, "building things out of bits" is how you're supposed to utilize hardware resources. You get a big heap of exposed computational primitives, and you map your design to them.

Local communication: routing vs register addresses

Another problem as basic as data processing operations is local communication: how does an operation pass its results to the next? We multiply and then add - how does the addition get the output of multiplication?

Again, the micro approach is by far the better known one. The idea is, you have registers, which are numbered somehow. We ask MUL to output to the register R5 (encoded as, say, 5). Then we ask ADD to use R5 as an input.

This actually doesn't sound as "micro" - what's implicit about it? We asked for R5 very explicitly. However, there are two sorts of "implicit" things going on here:

  • The numbers don't necessarily refer to physical registers - they don't on machines with register renaming.
  • More fundamentally, even when numbers do refer to physical registers, the routing is implicit.

How does the output of MUL travel to R5 and then to the input port of the adder? There are wires connecting these things, and multiplexers selecting between the various options. On most machines, there are also data forwarding mechanisms sending the output of MUL directly to the adder, in parallel to writing it into R5, so that ADD doesn't have to wait until R5 is written and then read back. But even on machines with explicit forwarding (and there aren't many), software doesn't see the wires and muxes - these are opaque hardware resources.

The macro approach to routing is what FPGAs do. The various computational units are connected to nearby configurable switches. By properly configuring those switches, you can send the output of one unit to another using a path going through several switches.

Of course this uses up the wires connecting the switches, and longer paths result in larger latencies. So it's not easy to efficiently connect all the units that you want using the switches and the wires that you have. In FPGAs, mapping operations to computational units and connecting between them is called "placement and routing". The "place & route" tools can run for a couple of hours given a large design.

This example as well as the previous illustrate micro vs macro at the extreme - a hardware resource that looks "all-important" in one architecture is invisible in another to the point where we forget it exists. The point is that they're equally important on both - the only question is who manages the resource, hardware or software.

Avoiding starvation: pressure signals vs request aging

One thing DRAM controllers do is accept requests from several different processors, put them in a queue, and reorder them. Reordering helps to better utilize DRAM, which, as previously mentioned, isn't that good at random access and prefers streaming access to consequent locations.

So if two processors, A and B, run in parallel, and each accesses a different region, it's frequently better to group requests together - A, A, A, A, B, B, B, B - then to process them in the order in which they arrive - say, A, B, A, A, B, A, B, B.

In fact, as long as A keeps issuing requests, it's frequently better to keep processing them until they're over, and keep B waiting. Better, that is, for throughput, as well as for A's latency - but worse for B's latency. If we don't know when to stop, serving A and starving B could make the system unusable.

When to stop? One macro solution is, the DRAM controller has incoming pressure signals, and both A and B can complain when starved by raising the pressure. Actually, this is "macro" only as far as the DRAM controller is concerned - it gives outside components explicit control over its behavior. The extent of software control over the generation of the pressure signal depends on the processors A and B.

One micro solution is to use request aging. Older requests are automatically considered more urgent. This method is implemented in many DRAM controllers - for instance, Denali's. The macro approach is implemented in the Arteris DRAM scheduler.

The micro approach is safer - the controller itself is careful to prevent starvation, whereas in the macro option, a non-cooperative processor can starve others. It also uses a simpler bus protocol, making compatibility easier for processors. However, it results in a lesser throughput - for instance, if B is a peripheral device with a large FIFO for incoming data, and can afford to wait for long periods of time until the FIFO overflows.

Whatever the benefits and drawbacks - and here, we aren't going to discuss benefits and drawbacks in any depth - this last example is supposed to illustrate that macro vs micro is relevant outside of "core"/"processor" design but extends to "non-computational" hardware as well.

Blurred boundaries

Micro vs macro is more of a continuum than a strictly binary distinction. That is, we can't always label a hardware feature as "visible" or "invisible" to programmers - rather, we can talk about the extent of its visibility.

There are basically two cases of "boundary blurring":

  • Hardware features "quite visible" even though they don't affect program semantics. These are "technically micro" but "macro in spirit".
  • Hardware features "quite invisible" even though they do affect program semantics. These are "technically macro" but "micro in spirit".

Let's briefly look at examples of both kinds of "blurring".

Technically micro but macro in spirit

A good example is memory banking. The point of banking is increasing the number of addresses that can be accessed per cycle. A single 32K memory bank lets you access a single address per cycle. 2 16K banks let you access 2 address, 4 8K banks let you access 4 addresses, and so on.

So basically "more is better". What limits the number of banks is the overhead you pay per bank, the overhead of logic figuring out the bank an address belongs to, and the fact that there's no point in accessing more data than you can process.

Now if we look at banking as implemented in NVIDIA GPU memory, TI DSP caches and superscalar CPU caches, then at first glance, they're all "micro solutions". These machines seem to mostly differ in their mapping of address to bank - for instance, NVIDIA GPUs switch banks every 4 bytes, while TI DSPs switch banks every few kilobytes.

But on all these machines, software can remain unaware of banking and run correctly. If two addresses are accessed at the same cycle that map to the same bank, then the access will take two cycles instead of one - but no fault is reported and results aren't affected. Semantically, banking is invisible.

However, I'd call GPUs' and DSPs' banking "macroish", and superscalar CPUs' banking "microish". Why?

GPUs and DSPs "advertise" banking, and commit to a consistent address mapping scheme and consistent performance implications across different devices. Vendors encourage you to know about banking so that you allocate data in ways minimizing contentions.

CPUs don't advertise banking very much, and different CPUs running the same instruction set have different banking schemes which result in different performance. Moreover, those CPU variants differ in their ability to access multiple addresses in parallel in the first place: a low-end CPU might access at most one address but a high-end CPU might access two.

GPUs and DSPs, on the other hand, have explicit multiple load-store units (a macro feature). So software knows when it attempts to accesses many addresses in parallel - one reason to "advertise" which addresses can actually be accessed in parallel.

This shows why hardware features that don't affect program semantics aren't "completely invisible to programmers" - rather, there are "degrees of visibility". A feature only affecting performance is "quite visible" if vendors and users consider it an important part of the hw/sw contract.

Technically macro but micro in spirit

SIMD and VLIW are both visible in assembly programs/binary instruction streams. However, SIMD is "much more macro in spirit" than VLIW. That's because for many programmers, the hw/sw contract isn't the semantics of assembly, which they never touch, but the semantics of their source language.

At the source code level, the effect of SIMD tends to be very visible. Automatic vectorization rarely works, so you end up using intrinsic functions and short vector data types. The effect of VLIW on source code can be close to zero. Compilers are great at automatic scheduling, and better than humans, so there's no reason to litter the code with any sort of annotations to help them. Hence, SIMD is "more macro" - more visible.

Moreover, there's "macroish VLIW" and "microish VLIW" - just like there's "macroish banking" and "microish banking" - and, again, the difference isn't in the hardware feature itself, but in the way it's treated by vendors and users.

An extreme example of "microish VLIW" is Transmeta - the native binary instruction encoding was VLIW, but the only software that was supposed to be aware of that were the vendor-supplied binary translators from x86 or other bytecode formats. VLIW was visible at the hardware level but still completely hidden from programmers by software tools.

An opposite, "macro-centric" example is TI's C6000 family. There's not one, but two "human-writable assembly languages". There's parallel assembly, where you get to manually schedule instructions. There's also linear assembly, which schedules instructions for you, but you still get to explicitly say which execution unit each instruction will use (well, almost; let's ignore the A-side/B-side issues here.)

Why provide such a "linear assembly" language? Josh Fisher, the inventor of VLIW, didn't approve of the concept in his book "Embedded Computing: a VLIW Approach".

That's because originally, one of the supposed benefits of VLIW was precisely being "micro in spirit" - the ability to hide VLIW behind an optimizing compiler meant that you could speed up existing code just by recompiling it. Not as easy as simply running old binaries on a stronger new out-of-order processor, but easy enough in many cases - and much easier to support at the hardware end.

Linear assembly basically throws these benefits out the window. You spell things in terms of C6000's execution units and opcodes, so the code can't be cross-platform. Worse, TI can't decide to add or remove execution units or registers from some of the C6000 variants and let the compiler reschedule instructions to fit the new variant. Linear assembly refers to units and registers explicitly enough to not support this flexibility - for instance, there's no silent spill code generation. Remove some of the resources, and much of the code will stop compiling.

Then why is linear assembly shipped by TI, and often recommended as the best source language for optimized code? The reason is that the code is more "readable" - if one of the things the reader is after is performance implications. The same silent spill code generation that makes C more portable makes it "less readable", performance-wise - you never can tell whether your data fits into registers or not, similarly it's hard to know how many operations of every execution unit are used.

The beauty of linear assembly is that it hides the stuff humans particularly hate to do and compilers excel at - such as register allocation and instruction scheduling - but it doesn't hide things making it easy to estimate performance - such as instruction selection and the distinction between stack and register variables. (In my opinion, the only problem with linear assembly is that it still hides a bit too much - and that people can choose to not use it. They often do - and preserve stunning unawareness of how the C6000 works for years and years.)

Personally, I believe that, contrary to original expectations, VLIW works better in "macro-centric" platforms than "micro-centric" - a viewpoint consistent with the relative success of Transmeta's chips and VLIW DSPs. Whether this view is right or wrong, the point here is that hardware features "visible" to software can be more or less visible to programmers - depending on how visible the software stack in its entirety makes them.

Implications

We've seen that "macro vs micro" is a trade-off appearing in a great many contexts in hardware design, and that typically, both types of solutions can be found in practical architectures - so it's not clear which is "better".

If there's no clear winner, what are the benefits and drawbacks of these two options? I believe that these benefits and drawbacks are similar across the many contexts where the trade-off occurs. Some of the implications were briefly mentioned in the discussion on VLIW's "extent of visibility" - roughly:

  • Micro is more compatible
  • Macro is more efficient
  • Macro is easier to implement

There are other common implications - for example, macro is harder to context-switch (I like this one because, while it's not very surprising once you think about it, it doesn't immediately spring to mind).

I plan to discuss the implications in detail sometime soon. I intend to focus, not as much on how things could be in theory, but on how they actually tend to come out and why.