Entries Tagged 'hardware' ↓

Consistency: how to defeat the purpose of IEEE floating point

I don’t know much about the design of IEEE floating point, except for the fact that a lot of knowledge and what they call “intellectual effort” went into it. I don’t even know the requirements, and I suspect those were pretty detailed and complex (for example, the benefits of having a separate representation for +0 and -0 seem hard to grasp unless you know about the very specific and hairy examples in the complex plane). So I don’t trust my own summary of the requirements very much. That said, here’s the summary: the basic purpose of IEEE floating point is to give you results of the highest practically possible precision at each step of your computation.

I’m not going to claim this requirement is misguided, because I don’t feel like arguing with people two orders of magnitude more competent than myself who have likely faced much tougher numerical problems than I’ve ever seen. What I will claim is that differences in numerical needs divide programmers into roughly three camps, and the highest-possible-precision approach hurts one of them really badly, and so has to be worked around in ways we’ll discuss. The camps are:

  1. The huge camp of people who do businessy accounting. Those should work with integral types to get complete, deterministic, portable control over rounding and all that. Many of the clueless people in this camp represent 1 dollar and 10 cents as the floating point number 1.1. While they are likely a major driving force behind economical growth, I still think they deserve all the trouble they bring upon themselves.
  2. The tiny camp doing high-end scientific computing. Those are the people who can really appreciate the design of IEEE floating point and use its full power. It’s great that humanity accidentally satisfied the needs of this small but really cool group, making great floating point hardware available everywhere through blind market forces. It’s like having a built-in Stradivari in each home appliance. Yes, perhaps I exaggerate; I get that a lot.
  3. The sizable camp that deals with low-end to mid-range semi-scientific computing. You know, programs that have some geometry or physics or algebra in them. 99.99% of the code snippets in that realm work great with 64b floating point, without the author having invested any thought at all into “numerical analysis”. 99% of the code works with 32b floats. When someone stumbles upon a piece of code in the 1% and actually witnesses fatal precision loss, everybody gathers to have a look as if they heard about a beautiful rainbow or a smoke suggesting a forest fire near the horizon.

The majority of people who use and actually should be using floating point are thus in camp 3. Those people don’t care about precision anywhere near camp 2, nor do they know how to make the best of IEEE floating point in the very unlikely circumstances where their naive approach will actually fail to work. What they do care about though is consistency. It’s important that things compute the same on all platforms. Perhaps more importantly for most, they should compute the same under different build settings, most notably debug and release mode, because otherwise you can’t reproduce problems.

Side note: I don’t believe in build modes; I usually debug production code in release mode. It’s not just floating point that’s inconsistent across modes - it’s code snippets with behavior undefined by the language, buggy dependence on timing, optimizer bugs, conditional compilation, etc. Many other cans of worms. But the fact is that most people have trouble debugging optimized code, and nobody likes it, so it’s nice to have the option to debug in debug mode, and to do that, you need things to reproduce there.

Also, comparing results of different build modes is one way to find worms from those other cans, like undefined behavior and optimizer bugs. Also, many changes you make are optimizations or refaptorings and you can check their sanity by making sure they didn’t change the results of the previous version. As we’ll see, IEEE FP won’t give you even that, regardless of platforms and build modes. The bottom line is that if you’re in camp 3, you want consistency, while the “only” things you can expect from IEEE FP is precision and speed. Sure, “only” should be put in quotes because it’s a lot to get, it’s just a real pity that fulfilling the smaller and more popular wish for consistency is somewhere between hard and impossible.

Some numerical analysts seem annoyed by the camp 3 whiners. To them I say: look, if IEEE FP wasn’t the huge success that it is in the precision and speed departments, you wouldn’t be hearing from us because we’d be busy struggling with those problems. What we’re saying is the exact opposite of “IEEE FP sucks”. It’s “IEEE FP is so damn precise and fast that I’m happy with ALL of its many answers - the one in optimized x86 build, the one in debug PowerPC build, the one before I added a couple of local variables to that function and the one I got after that change. I just wish I consistently got ONE of these answers, any of them, but the same one.” I think it’s more flattering than insulting.

I’ve accumulated quite some experience in defeating the purpose of IEEE floating point and getting consistency at the (tiny, IMO) cost of precision and speed. I want to share this knowledge with humanity, with the hope of getting rewarded in the comments. The reward I’m after is a refutation of my current theory that you can only eliminate 95%-99% of the pain automatically and have to solve the rest manually each time it raises its ugly head.

The pain breakdown

I know three main sources of floating point inconsistency pain:

  1. Algebraic compiler optimizations
  2. “Complex” instructions like multiply-accumulate or sine
  3. x86-specific pain not available on any other platform; not that ~100% of non-embedded devices is a small market share for a pain.

The good news is that most pain comes from item 3 which can be more or less solved automatically. For the purpose of decision making (”should we invest energy into FP consistency or is it futile?”), I’d say that it’s not futile and if you can cite actual benefits you’d get from consistency, than it’s worth the (continuous) effort.

Disclaimer: I only discuss problems I know and to the extent of my understanding. I have no solid evidence that this understanding is complete. Perhaps the pain breakdown list should have item 4, and perhaps items 1 to 3 have more meat than I think. And while I tried to get the legal stuff right - which behavior conforms to IEEE 754, which conforms to C99, and which conforms to nothing but is still out there - I’m generally a weak tech lawyer and can be wrong. I can only give you the “worked on my 4 families of machines” kind of warranty.

Algebraic compiler optimizations

Compilers, or more specifically buggy optimization passes, assume that floating point numbers can be treated as a field - you know, associativity, distributivity, the works. This means that a+b+c can be computed in either the order implied by (a+b)+c or the one implied by a+(b+c). Adding actual parentheses in source code doesn’t help you one bit. The compiler assumes associativity and may implement the computation in the order implied by regrouping your parentheses. Since each floating point operation loses precision on some of the possible inputs, the code generated by different optimizers or under different optimization settings may produce different results.

This could be extremely intimidating because you can’t trust any FP expression with more than 2 inputs. However, I think that programming languages in general don’t allow optimizers to make these assumptions, and in particular, the C standard doesn’t (C99 §5.1.2.3 #13, didn’t read it in the document but saw it cited in two sources). So this sort of optimization is actually a bug that will likely be fixed if reported, and at any given time, the number of these bugs in a particular compiler should be small.

I only recall one recurring algebraic optimization problem. Specifically, a*(b+1) tended to be compiled to a*b+a in release mode. The reason is that floating-point literal values like 1 are expensive; 1 becomes a hairy hexadecimal value that has to be loaded from a constant address at run time. So the optimizer was probably happy to optimize a literal away. I was always able to solve this problem by changing the spelling in the source code to a*b+a - the optimizer simply had less work to do, while the debug build saw no reason to make me miserable by undoing my regrouping back into a*(b+1).

This implies a general way of solving this sort of problem: find what the optimizer does by looking at the generated assembly, and do it yourself in the source code. This almost certainly guarantees that debug and release will work the same. With different compilers and platforms, the guarantee is less certain. The second optimizer may think that the optimization you copied from the first optimizer into your source code is brain-dead, and undo it and do a different optimization. However, that means you target two radically different optimizers, both of which are buggy and can’t be fixed in the near future; how unlucky can you get?

The bottom line is that you rarely have to deal with this problem, and when it can’t be solved with a bug report, you can look at the assembly and do the optimization in the source code yourself. If that fails because you have to use two very different and buggy compilers, use the shotgun described in the next item.

“Complex” instructions

Your target hardware can have instructions computing “non-trivial” expressions beyond a*b or a+b, such as a+=b*c or sin(x). The precision of the intermediate result b*c in a+=b*c may be higher than the size of an FP register would allow, had that result been actually stored in a register. IEEE and the C standard think it’s great, because the single instruction generated from a+=b*c is both faster and more precise than the 2 instructions implementing it as d=b*c, a=a+d. Camp 3 people like myself don’t think it’s so great, because it happens in some build modes but not others, and across platforms the availability of these instruction varies, as does their precision.

AFAIK the “contraction” of a+=b*c is permitted by both the IEEE FP standard (which defines FP + and *) and the C standard (which defines FP types that can map to standards other than IEEE). On the other hand, sin(x), which also gets implemented in hardware these days, isn’t addressed by either standard - to the same effect of making the optimization perfectly legitimate. So you can’t solve this by reporting a bug the way you could with algebraic optimizations. The other way in which this is tougher is that tweaking the code according to the optimizer’s wishes doesn’t help much. AFAIK what can help is one of these two things:

  1. Ask the compiler to not generate these instructions. Sometimes there’s an exact compiler option for that, like gcc’s platform-specific -mno-fused-madd flag, or there’s (a defined and actually implemented) pragma directive such as #pragma STDC FP_CONTRACT. Sometimes you don’t have any of that, but you can lie to the compiler that you’re using an older (compatible) revision of the processor architecture without the “complex” instructions. The latter is an all-or-nothing thing enabling/disabling lots of stuff, so it can degrade performance in many different ways; you have to check the cost.
  2. If compiler flags can’t help, there’s the shotgun approach I promised to discuss above, also useful for hypothetical cases of hard-to-work-around algebraic optimizations. Instead of helping the optimizer, we get in its way and make optimization impossible using separate compilation. For example, we can convert a+=b*c to a+=multiply_dammit(b,c); multiply_dammit is defined in a separate file. This makes it impossible for most optimizers to see the expression a+=b*c, and forces them to implement multiplication and addition separately. Modern compilers support link-time inlining and then they do optimize the result as a whole. But you can disable that option, and as a side effect speed up linkage a great deal; if that seriously hurts performance, your program is insane and you’re a team of scary ravioli coders.

The trouble with the shotgun approach, aside from its ugliness, is that you can’t afford to shoot at the performance-critical parts of your code that way. Let us hope that you’ll never really have to choose between FP consistency and performance, as I’ve never had to date.

x86

Intel is the birthplace of IEEE floating point, and the manufacturer of the most camp-3-painful and otherwise convoluted FP hardware. The pain comes, somewhat understandably, from a unique commitment to the IEEE FP philosophy - intermediate results should be as precise as possible; more on that in a moment. The “convoluted” part is consistent with the general insanity of the x86 instruction set. Specifically, the “old” (a.k.a “x87″) floating point unit uses a stack architecture for addressing FP operands, which is pretty much the exact opposite of the compiler writer’s dream target, but so is the rest of x86. The “new” floating point instructions in SSE don’t have these problems, at the cost of creating the aesthetic/psychiatric problem of actually having two FP ISAs in the same processor.

Now, in our context we don’t care about the FP stack thingie and all that, the only thing that matters is the consistency of precision. The “old” FP unit handles precision thusly. Precision of stuff written to memory is according to the number of bits of the variable, ’cause what else can it be. Precision of intermediate results in the “registers” (or the “FP stack” or whatever you call it) is defined according to the FPU control & status register, globally for all intermediate results in your program.

By default, it’s 80 bits. This means that when you compute a*b+c*d and a,b,c,d are 32b floats, a*b and c*d are computed in 80b precision, and then their 80b sum is converted to a 32b result in memory (if a*b+c*d is indeed written to memory and isn’t itself an “intermediate” result). Indeed, what’s “intermediate” in the sense of not being written to memory and what isn’t? That depends on:

  1. Debug/release build. If we have “float e=a*b+c*d”, and e is only used once right in the next line, the optimizer probably won’t see a point in writing it to memory. However, in a debug build there’s a good reason to allocate it in memory, because if you single-step your program and you’re already past the line that used e, you still might want to look at the value of e, so it’s good that the compiler kept a copy of it for the debugger to find.
  2. The code “near” e=a*b+c*d according to the compiler’s notion of proximity also affects its decisions. There are only so many registers, and sometimes you run out of them and have to store things in memory. This means that if you add or remove code near the line or in inline functions called near the line, the allocation of intermediate results may change.

Compilers could have an option asking them to hide this mess and give us consistent results. The problems with this are that (1) if you care about cross-platform/compiler consistency, then the availability of cross-mode consistency options in one compiler doesn’t help with the other compiler and (2) for some reason, compilers apparently don’t offer this option in a very good way. For example, MS C++ used to have a /fltconsistency switch but seems to have abandoned it in favor of an insane special-casing of the syntax float(a*b)+float(c*d) - that spelling forces consistency (although the C++ standard doesn’t assign it a special meaning not included in the plain and sane a*b+c*d).

I’d guess they changed it because of the speed penalty it implies rather than the precision penalty as they say. I haven’t heard about someone caring both about consistency and that level of precision, but I did hear that gcc’s consistency-forcing -ffloat-store flag caused notable slowdowns. And the reason it did is implied by its name - AFAIK the only way to implement x86 FP consistency at compile time is to generate code storing FP values to memory to get rid of the extra precision bits. And -ffloat-store only affects named variables, not unnamed intermediate results (annoying, isn’t it?), so /fltconsistency, assuming it actually gave you consistency of all results, should have been much slower. Anyway, the bottom line seems to be that you can’t get much help from compilers here; deal with it yourself. Even Java gave up on its initial intent of getting consistent results on the x87 FPU and retreated to a cowardly strictfp scheme.

And the thing is, you never have to deal with it outside of x86 - all floating point units I’ve come across, including the ones specified by Intel’s SSE and SSE2, simply compute 32b results from 32b inputs. People who decided to do it that way and rob us of quite some bits of precision have my deepest gratitude, because there’s absolutely no good way to work around the generosity of the original x86 FPU designers and get consistent results. Here’s what you can do:

  1. Leave the FP CSR configured to 80b precision. 32b and 64b intermediate results aren’t really 32b and 64b. The fact that it’s the default means that if you care about FP result consistency, intensive hair pulling during your first debugging sessions is an almost inevitable rite of passage.
  2. Set the FP CSR to 64b precision. If you only use 64b variables, debug==release and you’re all set. If you have 32b floats though, then intermediate 32b results aren’t really 32b. And usually you do have 32b floats.
  3. Set the FP CSR to 32b precision. debug==release, but you’re far from “all set” because now your 64b results, intermediate or otherwise, are really 32b. Not only is this a stupid waste of bits, it is not unlikely to fail, too, because 32b isn’t sufficient even for some of the problems encountered by camp 3. And of course it’s not compatible with other platforms.
  4. Set the FP CSR to 64b precision during most of the program run, and temporarily set it to 32b in the parts of the program using 32b floats. I wouldn’t go down that path; option 4 isn’t really an option. I doubt that you use both 32b and 64b variables in a very thoughtful way and manage to have a clear boundary between them. If you depend on the ability of everyone to correctly maintain the CSR, then it sucks sucks sucks.

Side note: I sure as hell don’t believe in “very special” “testing” build/running modes. For example, you could say that you have a special mode where you use option (3) and get 32b results, and use that mode to test debug==release or something. I think it’s completely self-defeating, because the point of consistency is being able to reproduce a phenomenon X that happens in a mode which is actually important, in another mode where reproducing X is actually useful. Therefore, who needs consistency across inherently useless modes? We’d be defeating the purpose of defeating the purpose of IEEE floating point.

Therefore, if you don’t have SSE, the only option is (2) - set the FP CSR to 64b and try to avoid 32b floats. On Linux, you can do it with:

#include <fpu_control.h>
fpu_control_t cw;
_FPU_GETCW(cw);
cw = (cw & ~_FPU_EXTENDED) | _FPU_DOUBLE;
_FPU_SETCW(cw);

Do it first thing in main(). If you use C++, you should do it first thing before main(), because people can use FP in constructors of global variables. This can be achieved by figuring out the compiler-specific translation unit initialization order, compiling your own C/C++ start-up library, overriding the entry point of a compiled start-up library using stuff like LD_PRELOAD, overwriting it in a statically linked program right there in the binary image, having a coding convention forcing to call FloatingPointSingleton::instance() before using FP, or shooting the people who like to do things before main(). It’s a trade-off.

The situation is really even worse because the FPU CSR setting only affects mantissa precision but not the exponent range, so you never work with “real” 64b or 32b floats there. This matters in cases of huge numbers (overflow) and tiny numbers (double rounding of subnormals). But it’s bad enough already, and camp 3 people don’t really care about the extra horror; if you want those Halloween stories, you can find them here. The good news are that today, you are quite likely to have SSE2 and very likely to have SSE on your machine. So you can automatically sanitize all the mess as follows:

  1. If you have SSE2, use it and live happily ever after. SSE2 supports both 32b and 64b operations and the intermediate results are of the size of the operands. BTW, mixed expressions like a+b where a is float and b is double don’t create consistency problems on any platform because the C standard specifies the rules for promotion precisely and portably (a will be promoted to double). The gcc way of using SSE2 for FP is -mfpmath=sse -msse2.
  2. If you only have SSE, use it for 32b floats which it does support (gcc: -mfpmath=sse -msse). 64b floats will go to the old FP unit, so you’ll have to configure it to 64b intermediate results. Everything will work, the only annoying things being (1) the retained necessity to shoot the people having fun before main and (2) the slight differences in the semantics of control flags in the old FP and the SSE FP CSR, so if you configure your own policy, floats and doubles will not behave exactly the same. Neither problem is a very big deal.

Interestingly, SSE with its support for SIMD FP commands actually can make things worse in the standard-violating-algebraic-optimizations department. Specifically, Intel’s compiler reportedly has (had?) an optimization which unrolls FP accumulation loops and reorders additions in order to utilize SIMD FP commands (gcc 4 does that, too, but only if you explicitly ask for trouble with -funsafe-math-optimizations or similar). But I wouldn’t conclude anything from it, except that automatic vectorization, which is known to work only on the simplest of code snippets, actually doesn’t work even on them.

Summary: use SSE2 or SSE, and if you can’t, configure the FP CSR to use 64b intermediates and avoid 32b floats. Even the latter solution works passably in practice, as long as everybody is aware of it.

I think I covered everything I know except for things like long double, FP exceptions, etc. - and if you need that, you’re not in camp 3; go away and hang out with your ivory tower numerical analyst friends. If you know a way to automate away more pain, I’ll be grateful for every FP inconsistency debugging afternoon your advice will save me.

Happy Halloween!

Optimal processor size

I’m going to argue that high-performance chip designs ought to use a (relatively) modest number of (relatively) strong cores. This might seem obvious. However, enough money is spent on developing the other kinds of chips to make the topic interesting, at least to me.

I must say that I understand everyone throwing millions of dollars at hardware which isn’t your classic multi-core design. I have an intimate relationship with multi-core chips, and we definitely hate each other. I think that multi-core chips are inherently the least productive programming environment available. Here’s why.

Our contestants are:

  • single-box, single-core
  • single-box, multi-core
  • multi-box

With just one core, you aren’t going to parallelize the execution of anything in your app just to gain performance, because you won’t gain any. You’ll only run things in parallel if there’s a functional need to do so. So you won’t have that much parallelism. Which is good, because you won’t have synchronization problems.

If you’re performance-hungry enough to need many boxes, you’re of course going to parallelize everything, but you’ll have to solve your synchronization problems explicitly and gracefully, because you don’t have shared memory. There’s no way to have a bug where an object happens to be accessed from two places in parallel without synchronization. You can only play with data that you’ve computed yourself, or that someone else decided to send you.

If you need performance, but for some reason can’t afford multiple boxes (you run on someone’s desktop or an embedded device), you’ll have to settle for multiple cores. Quite likely, you’re going to try to squeeze every cycle out of the dreaded device you have to live with just because you couldn’t afford more processing power. This means that you can’t afford message passing or a side-effect-free environment, and you’ll have to use shared memory.

I’m not sure about there being an inherent performance impact to message passing or to having no side effects. If I try to imagine a large system with massive data structures implemented without side effects, it looks like you have to create copies of objects at the logical level. Of course, these copies can then be optimized out by the implementation; I just think that some of the copies will in fact be implemented straight-forwardly in practice.

I could be wrong, and would be truly happy if someone explained to me why. I mean, having no side effects helps analyze the data flow, but the language is still Turing-complete, so you don’t always know when an object is no longer needed, right? So sometimes you have to make a new object and keep the old copy around, just in case someone needs it, right? What’s wrong with this reasoning? Anyway, today I’ll assume that you’re forced to use mutable shared memory in multi-core systems for performance reasons, and leave this no-side-effects business for now.

Summary: multiple cores is for performance-hungry people without a budget for buying computational power. So they end up with awful synchronization problems due to shared memory mismanagement, which is even uglier than normal memory mismanagement, like leaks or dangling references.

Memory mismanagement kills productivity. Maybe you disagree; I won’t try to convince you now, because, as you might have noticed, I’m desperately trying to stay on topic here. And the topic was that multi-core is an awful environment, so it’s natural for people to try to develop a better alternative.

Since multi-core chips are built for anal-retentive performance weenies without a budget, the alternative should also be a high-performance, cheap system. Since the clock frequency doesn’t double as happily as it used to these days, the performance must come from parallelism of some sort. However, we want to remove the part where we have independent threads accessing shared memory. What we can do is two things:

  • Use one huge processor.
  • Use many tiny processors.

What does processor “size” have to do with anything? There are basically two ways of avoiding synchronization problems. The problems come from many processors accessing shared memory. The huge processor option doesn’t have many processors; the tiny processors option doesn’t have shared memory.

The huge processor would run one thread of instructions. To compensate for having just one processor, each instruction would process a huge amount of data, providing the parallelism. Basically we’d have a SIMD VLIW array, except it would be much much wider/deeper than stuff like AltiVec, SSE or C6000.

The tiny processors would talk to their neighbor tiny processors using tiny FIFOs or some other kind of message passing. We use FIFOs to eliminate shared memory. We make the processors tiny because large processors are worthless if they can’t process large amounts of data, and large amounts of data mean lots of bandwidth, and lots of bandwidth means memory, and we don’t want memory. The advantage over the SIMD VLIW monster is that you run many different threads, which gives more flexibility.

So it’s either huge or tiny processors. I’m not going to name any particular architecture, but there were and still are companies working on such things, both start-ups and seasoned vendors. What I’m claiming is that these options provide less performance per square millimeter compared to a multi-core chip. So they can’t beat multi-core in the anal-retentive performance-hungry market. Multiple cores and the related memory mismanagement problems are here to stay.

What I’m basically saying is, for every kind of workload, there exists an optimal processor size. (Which finally gets me to the point of this whole thing.) If you put too much stuff into one processor, you won’t really be able to use that stuff. If you don’t put enough into it, you don’t justify the overhead of creating a processor in the first place.

When I think about it, there seems to be no way to define a “processor” in a “universal” way; a processor could be anything, really. Being the die-hard von-Neumann machine devotee that I am, I define a processor as follows:

  • It reads, decodes and executes an instruction stream (a “thread”)
  • It reads and writes memory (internal and possibly external)

This definition ignores at least two interesting things: that the human brain doesn’t work that way, and that you can have hyper-threaded processors. I’ll ignore both for now, although I might come back to the second thing some time.

Now, you can play with the “size” of the processor - its instructions can process tiny or huge amounts of data; the local memory/cache size can also vary. However, having an instruction processing kilobytes of data only pays off if you can normally give the processor that much data to multiply. Otherwise, it’s just wasted hardware.

In a typical actually interesting app, there aren’t that many places where you need to multiply a zillion adjacent numbers at the same cycle. Sure, your app does need to multiply a zillion numbers per second. But you can rarely arrange the computations in a way meeting the time and location constraints imposed by having just one thread.

I’d guess that people who care about, say, running a website back-end efficiently know exactly what I mean; their data is all intertwined and messy, so SIMD never works for them. However, people working on number crunching generally tend to underestimate the problem. The abstract mental model of their program is usually much more regular and simple in terms of control flow and data access patterns than the actual code.

For example, when you’re doing white board run time estimations, you might think of lots of small pieces of data as one big one. It’s not at all the same; if you try to convince a huge SIMD array that N small pieces of data are in fact one big one, you’ll get the idea.

For many apps, and I won’t say “most” because I’ve never counted, but for many apps, data parallelism can only get you that much performance; you’ll need task parallelism to get the rest. “Task parallelism” is when you have many processors running many threads, doing unrelated things.

One instruction stream is not enough, unless your app is really (and not theoretically) very simple and regular. If you have one huge processor, most of it will remain idle most of the time, so you will have wasted space in your chip.

Having ruled out one huge processor, we’re left with the other extreme - lots of tiny ones. I think that this can be shown to be inefficient in a relatively intuitive way.

Each time you add a “processor” to your system, you basically add overhead. Reading and decoding instructions and storing intermediate results to local memory is basically overhead. What you really want is to compute, and a processor necessarily includes quite some logic for dispatching computations.

What this means is that if you add a processor, you want it to have enough “meat” for it to be worth the trouble. Lots of tiny processors is like lots of managers, each managing too few employees. I think this argument is fairly intuitive, at least I find it easy to come up with a dumb real-world metaphor for it. The huge processor suffering from “lack of regularity” problems is a bit harder to treat this way.

Since a processor has an “optimal size”, it also has an optimal level of performance. If you want more performance, the optimal way to get it is to add more processors of the same size. And there you have it - your standard, boring multi-core design.

Now, I bet this would be more interesting if I could give figures for this optimal size. I could of course shamelessly weasel out of this one - the optimal size depends on lots of things, for example:

  • Application domain. x86 runs Perl; C6000 runs FFT. So x86 has speculative execution, and C6000 has VLIW. (It turns out that I use the name “Perl” to refer to all code dealing with messy data structures, although Python, Firefox and Excel probably aren’t any different. The reason must be that I think of “irregular” code in general and Perl in particular as a complicated phenomenon, and a necessary evil).
  • The cost of extra performance. Will your customer pay extra 80% for extra 20% of performance? For an x86-based system, the answer is more likely to be “yes” than for a C6000-based system. If the answer is “yes”, adding hardware for optimizing rare use cases is a good idea.

I could go on and on. However, just for the fun of it, I decided to share my current magic constants with you. In fact there aren’t many of them - I think that you can use at most 8-16 of everything. That is:

  • At most 8-16 vector elements for SIMD operations
  • At most 8-16 units working in parallel for VLIW machines
  • At most 8-16 processors per external memory module

Also, 8-16 of any of these is a lot. Many good systems will use, say, 2-4 because their application domain is more like Perl than FFT in some respect, so there’s no chance of actually utilizing that many resources in parallel.

I have no evidence that the physical constants of the universe make my magic constants universally true. If you know about a great chip that breaks these “rules”, it would be interesting to hear about it.

The Algorithmic Virtual Machine

There’s a very influential platform called the AVM, which stands for Algorithmic Virtual Machine. That’s the imaginary device people use as their mental model of a computer. In particular, it’s used by many people working on algorithms where performance matters. Performance matters in many different contexts, ranging from huge clusters processing astronomic amounts of data to modest applications running on pathetically weak hardware. However, I believe that the core architecture of the AVM is basically the same everywhere.

AVM application development is done using the ubiquitous AVM SDK - a whiteboard and a couple of hands for handwaving. An AVM application consists of a set of operations your algorithm needs executed. Each operation has a cost (typically one cycle, sometimes more). You can then estimate the run time of your algorithm by the clever technique of summing the cost of all operations.

These estimations are never close enough to the real run time. The definition of “close enough” varies; the quality of estimations, by and large, doesn’t. That is, I claim that your handwavy AVM-derived estimation will fail to meet your precision requirements no matter what those requirements are. Apparently our tolerance for errors grows with the lack of understanding of the problem, but it never grows enough. But I’m not really sure about this theory; I’m only sure about AVM-estimations-suck part. Here’s why.

The AVM is basically this imaginary machine that runs “operations”. Here are some things that real machines must do, but the AVM doesn’t:

  • Fetching instructions
  • Fetching operands
  • Testing for conditions
  • Storing results

Basically, the Algorithmic Virtual Machine developers concentrate on “operations” and ignore addressing, branches, caches, buses, registers, pipelines, and all those other gadgets which are needed in order to dispatch the operation. In fact, that’s how I currently distinguish between people who write software to get a job done and people who think of software as their job. “People who program” are into operations (algebra, networking, AI); “programmers” are into dispatching (programming languages, operating systems, OO). This is about mental focus rather than aptitude. I haven’t noticed that people of either group are inherently less productive than the other kind.

When they’re after performance, the “operations” people will naturally look for a way to reduce the number of operations. Sometimes, they’ll find an algorithm with a better asymptotic complexity - O(N+M) instead of O(N*M). At other times, they’ll come up with a way to perform 4*N*M operations instead of 16*N*M. Both results are very significant - if M and N are the only variables. The trouble is that you can’t see all the variables if you just look at the math (as in “we want to multiply and sum all these and then compare to that”). That way, you assume that you run on the AVM and leave out all the dispatching-related variables and get the wrong answer.

Is there a way to take the cost of dispatching into account? Not really, not without implementing your algorithm and measuring its performance. However, families of machines do have related sets of heuristics that can be used to guess the cost of running on them. For example, here are a couple of heuristics that I use for SIMD machines (they are relevant elsewhere, but their relative importance may drop):

  1. Bandwidth is costly.
  2. Addressing is costly.

These heuristics are vague, and I don’t see a very good way to make them formal. Perhaps there isn’t any. To show that my points have any formal significance, I’d have to formally prove that there’s unavoidable intrinsic cost to some things no matter how you build your hardware. And I don’t know how to go about that. So what I’ll do is I’ll give some examples to show what I mean, and leave it there.

Bandwidth

Consider two “algorithms” (probably too fancy a name in this context): computing dot product, and computing its partial sums (Matlab: sum(a .* b) and cumsum(a .* b)). Exactly the same amount of “operations” - N multiplications and N additions. Many people with BA, MSc and PhD degrees in CS assume that the run time is going to be the same, too. It won’t, because sum only produces one output, and cumsum produces N outputs. Worse, if the input vector elements are 8-bit integers, we probably need at least 32 bits for each output element. So we generate N*4 bytes of output from N*2 input bytes.

At this point, some people will say “Yeah, memory. Processors are fast, memories are slow, sure, memory is a problem”. But it isn’t just about the memory; memory bandwidth is just one kind of bandwidth. Let’s look at the non-memory problems of the partial-sums-of-dot-product algorithm. On the way, I’ll try to show how the “bandwidth costs” heuristic can be used to guess what your hardware can do and what the performance will be.

Consider a machine with a SIMD instruction set. Most likely, the machine has registers of fixed width (say, 16 bytes), and each instruction gets 2 inputs and produces 1 output. Why? Well, the hardware ought to support 2 inputs and 1 output to do basic math. Now, if it also wants to have an instruction that produces, say, 4 outputs, then it needs to have 3 additional output buses from the data processing units to the register file. It also needs a multiplexer so that each of the 4 outputs can be routed to each of its N registers (N can be 16 or 32 or even 128). The cost of multiplexers is, roughly, O(M*N), where M is the number of inputs and N is the number of outputs. That’s awfully costly. Bandwidth costs. So they probably use 2 inputs and 1 output everywhere.

Now, suppose the machine has 16 multipliers, which is quite likely - 1 multiplier for each register byte, so we can multiply 16 pairs of bytes simultaneously. Does this mean that we can then take those 16 products and compute 16 new partial sums, all in the same cycle? Nope, because, among other things, we’d need a command producing 16×4 bytes to do that, and that’s too much bandwidth. Are we likely to have a command that updates less than 16 accumulators? Yes, because that would speed up dot products, and dot products are very important; let’s look at the manual.

You’re likely to find a command updating - guess how many? - 4 accumulators (32 bits times 4 equals 16 bytes, that’s exactly one machine register). If the register size is 8 bytes, you’ll probably get a command updating 2 accumulators, and so on. Sometimes the machine uses “register pairs” for output; that doubles the register size for output bandwidth calculation purposes. The bottom line is that instruction set extensions can speed up dot product to an extent impossible for its partial sums. You might have noticed another problem here, that of the dependency of a partial sum on the previous partial sum. Removing this dependency doesn’t solve the bandwidth problem. For example, consider the vertical projection of point-wise multiplication of 2 8-bit images, which has the same not-enough-accumulators problem.

There is little you can do about the bandwidth problem in the partial sums case - the algorithm is I/O bound. Some algorithms aren’t, so you can optimize them to minimize the cost of bandwidth. For example, matrix multiplication is essentially lots of dot products. If you do those dot products straightforwardly, you’ll have a loop spending 2 commands for loading the matrix elements into registers, and one command for multiplying and accumulating (MAC). 2 loads per MAC means an overhead of 200%.

However, you can work on blocks - 4 rows of matrix A and 4 columns of matrix B, and compute the 4×4=16 dot products in your loop. That’s 4+4=8 loads per 16 MACs; the overhead dropped to 50%. If you have enough registers to do this. And it’s still quite impressive overhead, isn’t it? Your typical AVM user would be very disappointed. (Yes, some machines can parallelize the loads and the MACs, but some can’t, and it’s a toy example, and stop nitpicking). BTW, blocking can be used to save loads from main memory to cache just like we’ve used it to save loads from cache to registers.

OK. With partial sums of dot product, the bandwidth problem kills performance, and with matrix multiplication, it doesn’t. What about convolution, which is about as basic as our previous examples? Gee, I really don’t know. It’s tricky, because with convolution, you need to store intermediate results somewhere, and it’s unclear how many of them you’re going to need. The optimal implementation depends on the quirks of the data processing units, the I/O, and the filter size. If you come across a benchmark showing the performance of convolution on some machine, you’ll probably find interesting variations caused by the filter size.

So we have a bread-and-butter algorithm, and non-trivial & non-portable performance characteristics. I think it’s one indication that your own less straightforward algorithm will also perform somewhat unpredictably. Unless you know an exact reason for the opposite.

Addressing

Bandwidth is one problem with fetching operands and storing results. Another problem is figuring out where they go. In the case of registers, we have costly multiplexers for selecting the source and destination registers of instructions. In the case of memory, we have addresses. Computing addresses has a cost. Reading data from those addresses also has a cost. Some address sequences are costlier than others from one of these perspectives, or both.

The dumbest example is the misalignment problem. People who learned C on x86 are sometimes annoyed when they meet a PowerPC or an ARM or almost any other processor since it won’t read a 32-bit integer from a misaligned address. So when you read a binary buffer from a file or a socket, you can’t just cast the char* to an int* and expect it to work. Isn’t it nice of x86 to properly handle these cases?

Maybe it’s nice, maybe it isn’t (at least if it failed, the code would be fixed to become legal C), but it sure is costly. The fact that it’s “in the hardware” doesn’t make it a single-cycle operation. If your address is misaligned, the 32 bits may reside in two different memory words (no matter what the word size is). The hardware will have to read the low word, and then read the high word, and then take the high bits of the low word and the low bits of the high word and make a single 32-bit value out of them. Because in one cycle, memories can only fetch one word from an aligned address.

Does it matter outside of I/O-related code using illegal pointer-casting? Consider the prosaic algorithm of computing the first derivative of a vector, spelled v(2:end)-v(1:end-1) in Matlab. If we run on a SIMD machine, we could execute several subtractions simultaneously. In order to do that, we need to fetch a word containing v[0]…v[15] and a word containing v[1]…v[16] (both zero-based). But the second word is misaligned. The handling of misalignment will have a cost, whether it’s done in hardware or in software.

Well, at least the operands of subtraction live in subsequent addresses - 0,1,2…15 and 1,2,3…16. That’s how data processing units like them: you read a pack of numbers from memory and feed them right into the array of adders, ready to crunch them. It’s not always like that. Consider scaling: a(x) = b(s*x+t). This can be used to resize images (handy), or to play records at a different speed the way you’d do with a tape recorder (less handy, unless you like squeaky or growly voices).

Now, if s isn’t integral (say, s=0.6), you’d have to fetch data from places such as s*x+t = 1.3, 1.9, 2.5, 3.1, 3.7... Suppose you want to use linear interpolation to approximate a(1.3) as a(1)*0.7+a(2)*0.3. So now we need to multiply the vector of “low” elements - a([1,1,2...]) - by the vector of weights - [0.7,0.1,0.5...] - and add the result to the similar product a([2,2,3...])*[0.3,0.9,0.5...]. The multiplications and the additions map nicely to SIMD instruction sets; the indexing doesn’t, because you have those weird jumpy indexes. So this time, the addressing can become a real bottleneck because it can prevent you from using SIMD instructions altogether and serialize your entire computation.

Well, at least we access adjacent elements. This means that most memory accesses will hit the cache. When you bump into an element that isn’t cached yet, the machine will bring a whole cache line (say, 32 bytes), and then you’ll read the other elements in that cache line, so it will pay off. You can even issue cache prefetching instructions so that while you’re working on the current cache line, the machine will read the next one in the background. That way, you’ll hit the cache all the time, instead of having your processor repeatedly surprised (hey, I don’t have a(32) in the cache!.. hey, I don’t have a(64) in the cache!.. hey, I don’t have…). Avoiding the regularly scheduled surprise can be really beneficial, although cache prefetching is truly disgusting (it’s basically a very finicky kind of cooperative multi-tasking - you ought to stuff the prefetching commands into the exactly right spots in your code).

Now, consider a(x) = b(f(x)) - a generic transformation of an input vector given a function for computing the input coordinate from the output coordinate. We have no idea what the next address is going to be, do we? If the transformation is complicated enough, we’re going to miss the cache a lot. By the way, if the transformation is in fact simple, and the compiler knows the transformation at compile time, the compiler is still very unlikely to generate optimal cache prefetching commands. Which is one of the gazillion differences between C++ templates and “machine-optimal” code.

DVMs and TVMs

My bandwidth and addressing heuristics don’t model a real machine; they only model an upgrade to the AVM for SIMD machines. Multi-box computing is one example of an entire universe of considerations they fail to model. So what we got is a DVM - Domain-specific Virtual Machine.

Now, in order to estimate performance without measuring (which is necessary when you choose your optimizations - you just can’t try all the different options), I recommend a TVM (Target-specific Virtual Machine). You get one as follows. You start with the AVM. This gives overly optimistic performance estimations. You then add the features needed to get a DVM. This gives overly pessimistic estimations.

Then, you ask some low-level-loving person: “What are the coolest features of this machine that other machines don’t have?” This will give you the capabilities that the real processor has but its DVM doesn’t have. For example, PowerPC with AltiVec extensions is basically a standard SIMD DVM plus vec_perm. I won’t talk about vec_perm very much, but if you ever need to optimize for AltiVec, this is the one instruction you want to remember. It solves the indexing problem in the scaling example above, among other things. Using a SIMD DVM and forgetting about vec_perm would make AltiVec look worse than it really is, and some algorithms much more costly than they really are.

And this is how you get a TVM for your platform. The resulting mental model gives you a fairly realistic picture, second only to reading the entire manual and understanding the interactions of all the features (not that easy). And it definitely beats the AVM by… how do you estimate the quality of handwaving? OK, it beats the AVM by the factor of 5, on average. What, you want a proof? Just watch the hands go.

“High-level CPU”: follow-up

This is a follow-up on the previous entry, the “high-level CPU” challenge. I’ll try to summarize the replies and my opinion on the various proposals. But first, a summary of my original points:

  1. “Very” high-level languages have a cost. Attributing this cost to the underlying hardware architecture is wrong. You could move the cost from software to hardware, but that wouldn’t eliminate it. I primarily referred to languages characterized by indirection levels and late binding of user-defined operations, such as Lisp and Python, and to a lesser extent/confidence to side-effect-free languages like Haskell. I didn’t mean to say that high-level languages should not be used, in fact I think that their cost is wildly overestimated by many. However, denying the existence of any intrinsic cost guarantees that people will keep overestimating it, because if it weren’t that high a cost, why would you lie to them? I mean it very seriously; horrible tech marketing is responsible for the death (or coma) of many great things.
  2. Of all systems with similar cost and features, the one that has the least stuff implemented in hardware is the best, because you can change more things. The idea that moving things to hardware is a sure way to make them efficient is a misconception. Hardware can’t do “anything in one cycle”; there are many constraints involved. Therefore, it’s better to let the software explicitly control a set of low-level components than build hardware logic implementing high-level interfaces to them. For example, to add 2 numbers on a RISC machine, you load them to registers, then add. You could have a command adding operands from memory; it wouldn’t run faster, because the hardware would have to spend cycles on loading operands to (implicit) registers. Hardware doesn’t have to be a RISC machine, but it’s always better to move as much control to software as possible under the given system cost constraints.

I basically asked people to refute point 1 (”HLLs are costly”). What follows describes the attempts people made at it.

Computers you can’t program

Several readers managed to ignore my references to specific high-level languages and used the opportunity to pimp hardware architectures that can’t run those languages. Or any other programming languages designed for human beings, for that matter. Example architectures:

It is my opinion that the fans of this family of hardware/vaporware, consistent advocates of The New Age of Computing, have serious AI problems. Here’s a sample quote on cellular automata: “I guess they really are like us.” Well, if you want to build a computing device in order to have a relationship with it, maybe a cellular automaton will do the trick. Although I’d recommend to first check the fine selection of Homo Sapiens we have here on Planet Earth. Because those come with lots of features you’d like in a friend, a foe, a spouse or an employee already built-in, while computer hardware has a certain gap to fill in this department.

Me, I want to build machines to do stuff that someone “like us” wouldn’t want to do, for any of the several reasons (the job is hard/boring/stinky/whatever). And once I’ve built them, I want people to be able to use them. Please note this last point. People and other “nature’s computers”, like animals and fungi, aren’t supposed to be “used”. In fact, all those systems spend a huge amount of resources to avoid being used. Machines aren’t supposed to be like that. Machines are supposed to do what you want. Which means that both the designer and the user need to control them. Now, a computer that can’t even be tricked into parsing HTML in a straightforward way doesn’t look like it’s built to be controlled, does it?

Let me supply you with an example: Prolog. Prolog is an order of magnitude more tame than a neural net (and two orders of magnitude compared to a cellular automaton) when it comes to “control” - you can implement HTML parsing with it. But Prolog does show alarming signs of independence - it spends most of its time in its inference engine, an elaborate mechanism running lengthy non-trivial loops, which sometimes turn out to be infinite. You aren’t supposed to single-step those loops; you’re supposed to specify truths about your world, and Prolog will derive more truths for you. Prolog was supposed to be the wave of the future about 25 years ago. I think it can be safely called dead by now, despite the fair amount of money poured into it. I think it died because it’s extremely frustrating to use - you just can’t tell why the hell it worked that way in each particular case. I’ve never seen anything remotely as annoying as Prolog, with the notable exception of Makefiles, running on top of a wonderful inference engine of their own.

My current opinion is that neural networks rarely deserve a special hardware implementation - if you need them, build a more traditional computer and run them on top of that; and cellular automata are just stillborn. I might be wrong in the sense that a hardware implementation of these models is the optimal solution for some problem, hence we’ll see those beasts in some corner of a successful real-world system. But the vast majority of computing, including AI apps, will run on machines that support basic bread-and-butter programmer things simply and straightforwardly. Here’s a Computing Technology Acceptance Lower Bound for ya: if you can’t parse a frigging log file with it, you can’t do anything with it.

Self-assembly computers

Our next contestant is a machine that you surely can program, once you’ve built it from the pieces which came in the box. Some people mentioned “FPGA”, others failed to call it by its name (one comment mentioned a “giant hypercube of gates”, for example). In this part, I’m talking about the suggestions to use an FPGA without further advice on exactly how it should be used; that is, FPGA as the architecture, not FPGA used to prototype an architecture.

Maybe people think that with an FPGA, “everything is possible”, so in particular, you could easily build a processor efficiently implementing a HLL. Well, FPGA is just a way to implement hardware allowing you to trade NRE for unit cost. And with hardware, some things are possible and some aren’t, or so I claim - for example, you can’t magically make the cost of HLLs go away. If you can’t think of a way to reduce the overhead HLLs impose on the system cost, citing FPGA doesn’t make your argument look any better. On the contrary - you’ve saved NRE, but you’ve raised the cost of the hardware by the factor of 5.

Another angle: can you build a compiler? Probably so. Would you like to start your project with building a compiler? Probably not. Now, what makes people think that they want to build hardware themselves? I really don’t know. Building hardware is gnarly, FPGA or not - there are lots of constraints you have to think about to make the thing efficient, and it’s extremely easy to err on the side of not having enough flexibility. The latter typically happens because you try to implement overly high-level interfaces; it then turns out that you need the same low-level components to do something slightly different.

And changing hardware isn’t quite as easy as changing software, even with FPGA, because hardware description code, with its massive parallelism and underlying synthesis constraints, is fairly tricky. FPGA is a perfectly legitimate platform for hardware vendors, but an awful interface for application programmers. If you deliver FPGAs, make it your implementation detail; giving it to application programmers isn’t very likely to make them happy in the long run.

At the other end of the spectrum, there’s the kind of “self-assembly computer” that reassembles itself automatically, “adapting to the user’s needs”. Even if it made any sense (and it doesn’t), it still wouldn’t answer the question: how should this magical hardware adapt to handle HLLs, for example, indirect memory access?

Actual computers designed to run HLLs

Some people mentioned actual hardware which was built to run HLLs, including Reduceron, Tcl on Board, Lisp Machines, Rekursiv, and ARM’s Jazelle instruction set. For some reason, nobody mentioned Intel’s 432, an object-oriented microprocessor which was supposed to replace x86, but was, among other things, too slow. This illustrates that the existence of a “high-level processor” doesn’t mean that it was a good idea (of course it doesn’t mean the opposite, either).

I’ll now talk about these machines in increasing order of my confidence that the architecture doesn’t remove the overhead posed by the HLL it’s supposed to run.

  • Reduceron is designed to run Haskell, and focuses on an optimization problem I wasn’t even aware of, that of graph reduction. One of the primary ideas seem to be that graph reduction doesn’t suffer from dependency problems which could inhibit parallelization, but still can’t be parallelized on stock CPUs. That’s because a lot of memory access is involved, and there’s typically little load/store bandwidth available to a CPU compared to its data processing capability. Well, I agree with this completely in the sense that memory access is the number one area where custom hardware design can help; more on that later. However, I’m not sure that the right way to go about it is to build a “Haskell Machine”; building a lower-level processor with lots of bandwidth available to it could be better. Then again, it could be worse, and my confidence level in this area is extremely low, which is why I list the Reduceron before the others: I think I’ll look into this whole business some more. Pure functional languages are a weak spot of mine; for now, I can only say three things for sure: (1) side effects are a huge source of bugs, (2) although they get in the way of optimizers, side effects are a poor man’s number one source of optimizations, so living without them isn’t easy, and (3) the Reduceron is a pretty cool project.
  • Tcl on Board was built to run a Tcl dialect. Tcl doesn’t pose optimization problems that languages like Lisp or Python do - it’s largely a procedural language grinding flat objects. And there’s another thing I ought to tell you: I don’t like Tcl. However, I think that this Tcl chip is kind of insightful, because it’s designed for low-end applications. And the single biggest win of having a “high-level” instruction set is to save space on program encoding. Several people mentioned it as a big deal; I don’t think of it as a big deal, because instruction caches always worked great for me (~90% hits without any particular optimizations). However, for really small systems of the low-end embedded kind, program encoding is a real issue. I’m not saying that Tcl on Board is a good (or a bad) idea by itself; I know nothing about these things. I’m just saying that while I think high-level hardware will fail to deliver speed gains, it might give you space gains, so it may be the way to go for really small systems which aren’t supposed to scale. Not that I know much about those systems, except that if I’d have to build one, I’d seriously consider Forth…
  • Lisp Machines ran Lisp, and Rekursiv ran LINGO, which apparently was somewhat similar to Smalltalk. This I know. What I don’t know is how the hardware support for the high-level features would eliminate the cost overhead of the HLLs involved; that’s because I don’t know the architecture, and nobody gave much detail. I don’t see a way to solve the fundamental problems. I mean, if I want to support arrays of bytes, then each byte must be tagged, doesn’t it? And if I only support fixnums larger than bytes, then I’d waste space, right? And just what could the LispM do about the hairy binding done by CLOS behind the scenes? Again, this doesn’t mean these machines weren’t a good idea; in fact I wish my desktop hardware were more expensive and more secure, and tagged architectures could help. All I’m saying is that it would be more expensive. I think. I’d like to hear more about LispM, simply because most people who used it seem to be very fond of it - I know just one exception.
  • Jazelle is supposed to run Java. Java is significantly lower-level than Lisp or Smalltalk. It still is a beautiful example, because the hardware support in this case yields little performance benefits. In fact MIPS reported that a software implementation of JVM running on a MIPS core outperformed a JVM using Jazelle by a factor of about 2. I’ve never seen a refutation of that.

Stock computers with bells and whistles

Finally, there was a bunch of suggestions to add specific features to traditional processors.

  • Content-addressable memory is supposed to speed up associative array look-ups. There’s a well-known aphorism by Alan Perlis - “A language that doesn’t affect the way you think about programming is not worth knowing”. Here’s my attempt at an aphorism: “A processor that doesn’t affect the way you access memory is not worth building”. This makes the wide variety of tools designed to help you build a SIMD VLIW machine with your own data processing instructions uninteresting to me, and on the other hand, makes CAM quite appealing. I came to believe that your biggest problem isn’t processing the data, it’s fetching the data. I might talk about it some time; the Reduceron, essentially designed to solve a memory access problem preventing the optimization of a “perfectly parallelizable” algorithm, is one example of this. However, CAM goes way beyond providing more bandwidth or helping with the addressing - it adds comparison logic to each memory word. While it sounds impractical to replace all of your RAM with CAM, stashing a CAM array somewhere inside your system could help with some problems. Then again, it won’t necessarily pay off - it depends on the exact details of what you’re doing. All I can say at this point is that it’s a Worthy Idea, which, for some reason, I keep forgetting about, and I shouldn’t.
  • GC/reference counting optimizations. Maybe I’m wildly wrong, but I don’t think the garbage is a big deal, ’cause how much time do you spend on garbage collection compared to plain malloc/free? The way I see it, the problem isn’t so much with the overhead of garbage collection as it is with the amount of small objects allocated by the system and, most importantly, the amount of indirect memory accesses. I learned that some Lisp compilers can do object inlining with varying amounts of user intervention; well, when it works out, it removes the need for special hardware support. The thing is, I think the main battle here is to flatten objects, not to efficiently get rid of them. And I think that it’s quite clearly software that should fight that battle.
  • Regular expression and string functions in hardware: I don’t think it’s worth the trouble, because how much time do you spend in regex matching anyway? Maybe it’s because I don’t process massive volumes of text, but when I do process the moderate amounts of text I bump into, there’s the part where you store your findings in data structures, and I think it might be the bottleneck. And then a huge amount of data comes from places like RDBMSes where you don’t have to parse much. You’d end up with idle silicon, quietly leaking power.

The good stuff

At the bottom line, there were two hardware-related things which captured my intoxicated imagination: the Reduceron and content-addressable memories. If anything ever materializes around this, I’ll send out some samples. In the meanwhile - thanks!

The “high-level CPU” challenge

Do you love (”very”) high-level languages? Like Lisp, Smalltalk, Python, Ruby? Or maybe Haskell, ML? I love high-level languages.

Do you think high-level languages would run fast if the stock hardware weren’t “brain-damaged”/”built to run C”/”a von Neumann machine (instead of some other wonderful thing)”? You do think so? I have a challenge for you. I bet you’ll be interested.

Background:

  • I work on the definition of custom instruction set processors (just finished one).
  • It’s fairly high-end stuff (MHz/transistor count in the hundreds of millions).
  • I also work on the related programming languages (compilers, etc.).
  • Whenever application programmers have to deal with low-level issues of the machine I’m (partly) responsible for, I feel genuine shame. They should be doing their job; the machine details are my job. Feels like failure (even if “the state of the art” isn’t any better).
  • …But, I’m also obsessed with performance. Because the apps which run on top of my stuff are ever-hungry, number-crunching real time monsters. Online computer vision. Loads of fun, and loads of processing that would make a “classic” DSP hacker’s eyeballs pop out of his skull.

My challenge is this. If you think that you know how hardware and/or compilers should be designed to support HLLs, why don’t you actually tell us about it, instead of briefly mentioning it? Requirement: your architecture should allow to run HLL code much faster than a compiler emitting something like RISC instructions, without significant physical size penalties. In other words, if I have so many square millimeters of silicon, and I pad it with your cores instead of, say, MIPS cores, I’ll be able to implement my apps in a much more high-level fashion without losing much performance (25% sounds like a reasonable upper bound). Bonus points for intrinsic support for vectorized low-precision computations.

If your architecture meets these requirements, I’ll consider a physical implementation very seriously (because we could use that kind of thing), and if it works out, you’ll get a chip so you can show people what your ideas look like. I can’t promise anything, because, as usual, there are more forces at play than the theoretical technical virtue of an idea. I can only promise to publicly announce that your idea was awesome and I’d love to implement it; not much, but it’s the best I can deliver.

If you can’t think of anything, then your consistent assertions about “stupid hardware” are a stupid bluff. Do us a favor and shut up. WARNING: I can’t do hardware myself, but there are lots of brilliant hardware hackers around me, and I’ve seen how actual chips are made and what your constraints are. Don’t bullshit me, buddy.

Seriously, I’m sick and tired of HLL weenie trash talk. Especially when it comes from apparently credible and exceedingly competent people.

Alan Kay, the inventor of Smalltalk: “Just as an aside, to give you an interesting benchmark—on roughly the same system, roughly optimized the same way, a benchmark from 1979 at Xerox PARC runs only 50 times faster today. Moore’s law has given us somewhere between 40,000 and 60,000 times improvement in that time. So there’s approximately a factor of 1,000 in efficiency that has been lost by bad CPU architectures.” … “We’re not going to worry about whether we can compile it into a von Neumann computer or not, and we will make the microcode do whatever we need to get around these inefficiencies because a lot of the inefficiencies are just putting stuff on obsolete hardware architectures.”

Jamie Zawinski, an author of Mozilla: “In a large application, a good garbage collector is more efficient than malloc/free.” … “Don’t blame the concept of GC just because you’ve never seen a good GC that interfaces well with your favorite language.” Elsewhere: “it’s a misconception that lisp is, by its nature, slow, or even slower than C” … “if you’re doing a *big* project in C or C++, well, you’re going to end up reinventing most of the lisp runtime anyway”

Steve Yegge, a great tech blogger: “The von Neumann machine is a convenient, cost-effective, 1950s realization of a Turing Machine, which is a famous abstract model for performing computations.” … “There are various other kinds of computers, such as convenient realizations of neural networks or cellular automata, but they’re nowhere as popular either, at least not yet”. And… “The Von Neumann architecture is not the only one out there, nor is it going to last much longer (in the grand 400-year scheme of things.)”

Wow. Sounds dazzling and mind-opening, doesn’t it? Except there isn’t any technical detail whatsoever. I mean, it’s important to be open-minded and stuff. It really is. The fact that something doesn’t seem “practical” doesn’t mean you shouldn’t think or talk about it. But if something isn’t even a something, just a vague idea about Awesome Coolness, it poisons the readers’ minds, people. It’s like talking about Spirituality of the kind that lets you jump over cliffs at your mighty will or something (I’m not that good at New Age, but I think they have things like these in stock). This can only lead to three results:

  1. Your reader ignores you.
  2. Your reader sits on a couch and waits to gain enough Spirituality to jump around cliffs. Congratulations! Your writing has got you one fat fanboy.
  3. Your reader assumes he’s Spiritual enough already and jumps off a cliff, so you’ve got a slim fanboy corpse.

It’s the same with this Great High-Level Hardware talk. I can ignore it, or I can wait forever until it emerges, or I can miserably fail trying to do it myself. Seriously, let’s look at these claims a little closer.

Alan Kay mentions a benchmark showing how lame our CPUs are. I’d really like to see that benchmark. Because I’ve checked out the B5000 which he praised in that article. And I don’t think a modern implementation of that architecture would beat a modern CPU in terms of raw efficiency. You see, RISC happened for a reason. Very roughly, it’s like this:

  • You can access memories at single cycle throughput.
  • You can process operands in registers at single cycle throughput.
  • And that’s pretty much what you can do.

Suppose you want to support strings and have a string comparison instruction. You might think that “it’s done in the hardware”, so it’s blindingly fast. It isn’t, because the hardware still has to access memory, one word per cycle. A superscalar/VLIW assembly loop would run just as quickly; the only thing you’d save is a few bytes for instruction encoding. On the other hand, your string comparison thingie has got you into several sorts of trouble:

  • Your machine is larger, with little gain - you don’t compare strings most of the time.
  • Your machine is complicated, so optimizing the hardware is trickier.
  • Compilers have trouble actually utilizing your instructions.
  • Especially as the underlying hardware implementation grows more complicated and the performance of assembly code gets harder to model.

When people were supposed to write assembly programs, the inclusion of complicated high-level instructions was somewhat natural. When it became clear that compilers write most of the programs (because compilation became cheap enough), processors became less high-level; the points above hopefully explain why.

And don’t get me started about the tagging of data words. B5000 had polymorphic microcode - it would load two words, look at their type bits and add them according to the run time types. Well, B5000 didn’t support things like unsigned 8-bit integers, which happen to be something I need, because that’s how you store images, for example. Am I supposed to carry tag bits in each frigging byte? Let me point out that it has its cost. And I don’t think this sort of low-level polymorphism dwarfs the cost of Lisp or Smalltalk-style dynamic binding, either (B5000 was designed to run Algol; what would you do to run Smalltalk?)

There’s another angle to it: Alan Kay mentions that you almost couldn’t crash the B5000, which suited the business apps it was supposed to run quite well. I think that’s just awesome, I really do (I shoveled through lots of core dumps). In fact, I think people who implemented modern desktop operating systems and web browsers in unsafe languages on top of unsafe hardware are directly responsible for the vast majority of actual security problems out there. But (1) in many systems, the performance is really really important and (2) I think that security in software, the way it’s done in JVM or .NET, still has lower overall cost than tagging every byte (I’m less sure about part 2 because I don’t really know the guts of those VMs). Anyway, I think that hardware-enforced safety is costly, and you ought to acknowledge it (or really show why this fairly intuitive assumption is wrong, that is, delve into the details).

JWZ’s Lisp-can-be-efficient-on-stock-hardware claim isn’t much better than Smalltalk-can-be-efficient-on-custom-hardware, I find. Just how can it be? If you use Lisp’s static annotation system, your code becomes uglier than Java, and much less safe (I don’t think Lisp does static checking of parameter types, it just goes ahead and passes you an object and lets you think it’s an integer). If you use Lisp in the Lispy way that makes it so attractive in the first place, how on Earth can you optimize out the dynamic type checks and binding? You’d have to solve undecidable problems to make sense of the data flow. “A large project in C would implement the Lisp run time?” Oh really? You mean each variable will have the type LispObject (or PyObject or whatever)? Never happens, unless the C code is written by a deeply disturbed Lisp weenie (gcc and especially BetaPlayer, I’m talking about you). The fact that some people write C code as if they were a Lisp back-end is their personal problem, nothing more, nothing less.

The dynamic memory allocation business is no picnic, either. I won’t argue that garbage collection is significantly less efficient than manual malloc/free calls, because I’m not so sure about it. What I will argue is that a good Lisp program will use much more dynamic allocation and indirection levels than a good C program (again, I ignore the case of emulating C in Lisp, or Lisp in C, because I think it’s a waste of time anyway). And if you want to make your objects flat, I think you need a static type system, so you won’t be much higher-level than Java in terms of dynamic flexibility. And levels of indirection are extremely costly because every data-dependent memory access is awfully likely to introduce pipeline stalls.

Pure functional languages with static typing have their own problem - they lack side effects and make lots of copies at the interface level; eliminating those copies is left as an exercise to the compiler writer. I’ve never worked through a significant array of such exercises, so I won’t argue about the problems of that. I’ll just mention that static typing (irregardless of the type inference technique) characterizes lower-level languages, because now I have to think about types, just the way imperative programming is lower-level than functional programming, because now I have to think about the order of side effects. You can tell me that I don’t know what “high-level” means; I won’t care.

Now, the von Neumann machine business. Do you realize the extent to which memory arrays are optimized and standardized today? It’s nowhere near what happens with CPUs. There are lots of CPU families running lots of different instruction sets. All memories just load and store. Both static RAM (the expensive and fast kind) and dynamic RAM (the cheap and slower kind) are optimized to death, from raw performance to factory testing needed to detect manufacturing defects. You don’t think about memories when you design hardware, just the way you don’t think about the kind of floating point you want to use in your numeric app - you go for IEEE because so much intellectual effort was invested in it on all levels to make it work well.

But let’s go with the New Age flow of “von Neumann machine is a relic from the fifties”. What kinds of other architectures are there, and how do you program them, may I ask? “C is for von Neumann machines”. Well, so is Java and so is Lisp; all have contiguous arrays. Linked lists and dictionaries aren’t designed for any other kind of machine, either; in fact lots of standard big O complexity analysis assumes a von Neumann machine - O(1) random access.

And suppose you’re willing to drop standard memories and standard programming languages and standard complexity analysis. I don’t think you’re a crackpot, I really don’t; I think you’re bluffing, most probably, but you could be a brilliant and creative individual. I sincerely don’t think that anything practiced by millions can automatically be considered “true” or “right”; I was born in the Soviet Union, so I know all about it. Anyway, I want to hear your ideas. I have images. I must process those images and find stuff in them. I need to write a program and control its behavior. You know, the usual edit-run-debug-swear cycle. What model do you propose to use? Don’t just say “neural nets”. Let’s hear some details about hardware architecture.

I really want to know. I assume that an opinion held by quite some celebrities is shared by lots and lots of people out there. Many of you are competent programmers, some stronger than myself. Tell me why I’m wrong. I’ll send you a sample chip. I’ll publicly admit I was a smug misinformed dumbass. Whatever you like. I want to close this part of “efficient/high-level isn’t a trade-off” nonsense, so that I can go back to my scheduled ranting about the other part. You know, when I poke fun at C++ programmers who think STL is “high-level” (ha!). But until this “Lisp is efficient” (ha!) issue lingers, I just can’t go on ranting with clear conscience. Unbalanced ranting is evil, don’t you think?