The "high-level CPU" challenge

January 29th, 2008

Do you love ("very") high-level languages? Like Lisp, Smalltalk, Python, Ruby? Or maybe Haskell, ML? I love high-level languages.

Do you think high-level languages would run fast if the stock hardware weren't "brain-damaged"/"built to run C"/"a von Neumann machine (instead of some other wonderful thing)"? You do think so? I have a challenge for you. I bet you'll be interested.

Background:

I work on the definition of custom instruction set processors (just finished one).
It's fairly high-end stuff (MHz/transistor count in the hundreds of millions).
I also work on the related programming languages (compilers, etc.).
Whenever application programmers have to deal with low-level issues of the machine I'm (partly) responsible for, I feel genuine shame. They should be doing their job; the machine details are my job. Feels like failure (even if "the state of the art" isn't any better).
...But, I'm also obsessed with performance. Because the apps which run on top of my stuff are ever-hungry, number-crunching real time monsters. Online computer vision. Loads of fun, and loads of processing that would make a "classic" DSP hacker's eyeballs pop out of his skull.

My challenge is this. If you think that you know how hardware and/or compilers should be designed to support HLLs, why don't you actually tell us about it, instead of briefly mentioning it? Requirement: your architecture should allow to run HLL code much faster than a compiler emitting something like RISC instructions, without significant physical size penalties. In other words, if I have so many square millimeters of silicon, and I pad it with your cores instead of, say, MIPS cores, I'll be able to implement my apps in a much more high-level fashion without losing much performance (25% sounds like a reasonable upper bound). Bonus points for intrinsic support for vectorized low-precision computations.

If your architecture meets these requirements, I'll consider a physical implementation very seriously (because we could use that kind of thing), and if it works out, you'll get a chip so you can show people what your ideas look like. I can't promise anything, because, as usual, there are more forces at play than the theoretical technical virtue of an idea. I can only promise to publicly announce that your idea was awesome and I'd love to implement it; not much, but it's the best I can deliver.

If you can't think of anything, then your consistent assertions about "stupid hardware" are a stupid bluff. Do us a favor and shut up. WARNING: I can't do hardware myself, but there are lots of brilliant hardware hackers around me, and I've seen how actual chips are made and what your constraints are. Don't bullshit me, buddy.

Seriously, I'm sick and tired of HLL weenie trash talk. Especially when it comes from apparently credible and exceedingly competent people.

Alan Kay, the inventor of Smalltalk: "Just as an aside, to give you an interesting benchmark—on roughly the same system, roughly optimized the same way, a benchmark from 1979 at Xerox PARC runs only 50 times faster today. Moore’s law has given us somewhere between 40,000 and 60,000 times improvement in that time. So there’s approximately a factor of 1,000 in efficiency that has been lost by bad CPU architectures." ... "We’re not going to worry about whether we can compile it into a von Neumann computer or not, and we will make the microcode do whatever we need to get around these inefficiencies because a lot of the inefficiencies are just putting stuff on obsolete hardware architectures."

Jamie Zawinski, an author of Mozilla: "In a large application, a good garbage collector is more efficient than malloc/free." ... "Don't blame the concept of GC just because you've never seen a good GC that interfaces well with your favorite language." Elsewhere: "it's a misconception that lisp is, by its nature, slow, or even slower than C" ... "if you're doing a *big* project in C or C++, well, you're going to end up reinventing most of the lisp runtime anyway"

Steve Yegge, a great tech blogger: "The von Neumann machine is a convenient, cost-effective, 1950s realization of a Turing Machine, which is a famous abstract model for performing computations." ... "There are various other kinds of computers, such as convenient realizations of neural networks or cellular automata, but they're nowhere as popular either, at least not yet". And... "The Von Neumann architecture is not the only one out there, nor is it going to last much longer (in the grand 400-year scheme of things.)"

Wow. Sounds dazzling and mind-opening, doesn't it? Except there isn't any technical detail whatsoever. I mean, it's important to be open-minded and stuff. It really is. The fact that something doesn't seem "practical" doesn't mean you shouldn't think or talk about it. But if something isn't even a something, just a vague idea about Awesome Coolness, it poisons the readers' minds, people. It's like talking about Spirituality of the kind that lets you jump over cliffs at your mighty will or something (I'm not that good at New Age, but I think they have things like these in stock). This can only lead to three results:

Your reader ignores you.
Your reader sits on a couch and waits to gain enough Spirituality to jump around cliffs. Congratulations! Your writing has got you one fat fanboy.
Your reader assumes he's Spiritual enough already and jumps off a cliff, so you've got a slim fanboy corpse.

It's the same with this Great High-Level Hardware talk. I can ignore it, or I can wait forever until it emerges, or I can miserably fail trying to do it myself. Seriously, let's look at these claims a little closer.

Alan Kay mentions a benchmark showing how lame our CPUs are. I'd really like to see that benchmark. Because I've checked out the B5000 which he praised in that article. And I don't think a modern implementation of that architecture would beat a modern CPU in terms of raw efficiency. You see, RISC happened for a reason. Very roughly, it's like this:

You can access memories at single cycle throughput.
You can process operands in registers at single cycle throughput.
And that's pretty much what you can do.

Suppose you want to support strings and have a string comparison instruction. You might think that "it's done in the hardware", so it's blindingly fast. It isn't, because the hardware still has to access memory, one word per cycle. A superscalar/VLIW assembly loop would run just as quickly; the only thing you'd save is a few bytes for instruction encoding. On the other hand, your string comparison thingie has got you into several sorts of trouble:

Your machine is larger, with little gain – you don't compare strings most of the time.
Your machine is complicated, so optimizing the hardware is trickier.
Compilers have trouble actually utilizing your instructions.
Especially as the underlying hardware implementation grows more complicated and the performance of assembly code gets harder to model.

When people were supposed to write assembly programs, the inclusion of complicated high-level instructions was somewhat natural. When it became clear that compilers write most of the programs (because compilation became cheap enough), processors became less high-level; the points above hopefully explain why.

And don't get me started about the tagging of data words. B5000 had polymorphic microcode – it would load two words, look at their type bits and add them according to the run time types. Well, B5000 didn't support things like unsigned 8-bit integers, which happen to be something I need, because that's how you store images, for example. Am I supposed to carry tag bits in each frigging byte? Let me point out that it has its cost. And I don't think this sort of low-level polymorphism dwarfs the cost of Lisp or Smalltalk-style dynamic binding, either (B5000 was designed to run Algol; what would you do to run Smalltalk?)

There's another angle to it: Alan Kay mentions that you almost couldn't crash the B5000, which suited the business apps it was supposed to run quite well. I think that's just awesome, I really do (I shoveled through lots of core dumps). In fact, I think people who implemented modern desktop operating systems and web browsers in unsafe languages on top of unsafe hardware are directly responsible for the vast majority of actual security problems out there. But (1) in many systems, the performance is really really important and (2) I think that security in software, the way it's done in JVM or .NET, still has lower overall cost than tagging every byte (I'm less sure about part 2 because I don't really know the guts of those VMs). Anyway, I think that hardware-enforced safety is costly, and you ought to acknowledge it (or really show why this fairly intuitive assumption is wrong, that is, delve into the details).

JWZ's Lisp-can-be-efficient-on-stock-hardware claim isn't much better than Smalltalk-can-be-efficient-on-custom-hardware, I find. Just how can it be? If you use Lisp's static annotation system, your code becomes uglier than Java, and much less safe (I don't think Lisp does static checking of parameter types, it just goes ahead and passes you an object and lets you think it's an integer). If you use Lisp in the Lispy way that makes it so attractive in the first place, how on Earth can you optimize out the dynamic type checks and binding? You'd have to solve undecidable problems to make sense of the data flow. "A large project in C would implement the Lisp run time?" Oh really? You mean each variable will have the type LispObject (or PyObject or whatever)? Never happens, unless the C code is written by a deeply disturbed Lisp weenie (gcc and especially BetaPlayer, I'm talking about you). The fact that some people write C code as if they were a Lisp back-end is their personal problem, nothing more, nothing less.

The dynamic memory allocation business is no picnic, either. I won't argue that garbage collection is significantly less efficient than manual malloc/free calls, because I'm not so sure about it. What I will argue is that a good Lisp program will use much more dynamic allocation and indirection levels than a good C program (again, I ignore the case of emulating C in Lisp, or Lisp in C, because I think it's a waste of time anyway). And if you want to make your objects flat, I think you need a static type system, so you won't be much higher-level than Java in terms of dynamic flexibility. And levels of indirection are extremely costly because every data-dependent memory access is awfully likely to introduce pipeline stalls.

Pure functional languages with static typing have their own problem – they lack side effects and make lots of copies at the interface level; eliminating those copies is left as an exercise to the compiler writer. I've never worked through a significant array of such exercises, so I won't argue about the problems of that. I'll just mention that static typing (irregardless of the type inference technique) characterizes lower-level languages, because now I have to think about types, just the way imperative programming is lower-level than functional programming, because now I have to think about the order of side effects. You can tell me that I don't know what "high-level" means; I won't care.

Now, the von Neumann machine business. Do you realize the extent to which memory arrays are optimized and standardized today? It's nowhere near what happens with CPUs. There are lots of CPU families running lots of different instruction sets. All memories just load and store. Both static RAM (the expensive and fast kind) and dynamic RAM (the cheap and slower kind) are optimized to death, from raw performance to factory testing needed to detect manufacturing defects. You don't think about memories when you design hardware, just the way you don't think about the kind of floating point you want to use in your numeric app – you go for IEEE because so much intellectual effort was invested in it on all levels to make it work well.

But let's go with the New Age flow of "von Neumann machine is a relic from the fifties". What kinds of other architectures are there, and how do you program them, may I ask? "C is for von Neumann machines". Well, so is Java and so is Lisp; all have contiguous arrays. Linked lists and dictionaries aren't designed for any other kind of machine, either; in fact lots of standard big O complexity analysis assumes a von Neumann machine – O(1) random access.

And suppose you're willing to drop standard memories and standard programming languages and standard complexity analysis. I don't think you're a crackpot, I really don't; I think you're bluffing, most probably, but you could be a brilliant and creative individual. I sincerely don't think that anything practiced by millions can automatically be considered "true" or "right"; I was born in the Soviet Union, so I know all about it. Anyway, I want to hear your ideas. I have images. I must process those images and find stuff in them. I need to write a program and control its behavior. You know, the usual edit-run-debug-swear cycle. What model do you propose to use? Don't just say "neural nets". Let's hear some details about hardware architecture.

I really want to know. I assume that an opinion held by quite some celebrities is shared by lots and lots of people out there. Many of you are competent programmers, some stronger than myself. Tell me why I'm wrong. I'll send you a sample chip. I'll publicly admit I was a smug misinformed dumbass. Whatever you like. I want to close this part of "efficient/high-level isn't a trade-off" nonsense, so that I can go back to my scheduled ranting about the other part. You know, when I poke fun at C++ programmers who think STL is "high-level" (ha!). But until this "Lisp is efficient" (ha!) issue lingers, I just can't go on ranting with clear conscience. Unbalanced ranting is evil, don't you think?