Do you love (”very”) high-level languages? Like Lisp, Smalltalk, Python, Ruby? Or maybe Haskell, ML? I love high-level languages.
Do you think high-level languages would run fast if the stock hardware weren’t “brain-damaged”/”built to run C”/”a von Neumann machine (instead of some other wonderful thing)”? You do think so? I have a challenge for you. I bet you’ll be interested.
Background:
- I work on the definition of custom instruction set processors (just finished one).
- It’s fairly high-end stuff (MHz/transistor count in the hundreds of millions).
- I also work on the related programming languages (compilers, etc.).
- Whenever application programmers have to deal with low-level issues of the machine I’m (partly) responsible for, I feel genuine shame. They should be doing their job; the machine details are my job. Feels like failure (even if “the state of the art” isn’t any better).
- …But, I’m also obsessed with performance. Because the apps which run on top of my stuff are ever-hungry, number-crunching real time monsters. Online computer vision. Loads of fun, and loads of processing that would make a “classic” DSP hacker’s eyeballs pop out of his skull.
My challenge is this. If you think that you know how hardware and/or compilers should be designed to support HLLs, why don’t you actually tell us about it, instead of briefly mentioning it? Requirement: your architecture should allow to run HLL code much faster than a compiler emitting something like RISC instructions, without significant physical size penalties. In other words, if I have so many square millimeters of silicon, and I pad it with your cores instead of, say, MIPS cores, I’ll be able to implement my apps in a much more high-level fashion without losing much performance (25% sounds like a reasonable upper bound). Bonus points for intrinsic support for vectorized low-precision computations.
If your architecture meets these requirements, I’ll consider a physical implementation very seriously (because we could use that kind of thing), and if it works out, you’ll get a chip so you can show people what your ideas look like. I can’t promise anything, because, as usual, there are more forces at play than the theoretical technical virtue of an idea. I can only promise to publicly announce that your idea was awesome and I’d love to implement it; not much, but it’s the best I can deliver.
If you can’t think of anything, then your consistent assertions about “stupid hardware” are a stupid bluff. Do us a favor and shut up. WARNING: I can’t do hardware myself, but there are lots of brilliant hardware hackers around me, and I’ve seen how actual chips are made and what your constraints are. Don’t bullshit me, buddy.
Seriously, I’m sick and tired of HLL weenie trash talk. Especially when it comes from apparently credible and exceedingly competent people.
Alan Kay, the inventor of Smalltalk: “Just as an aside, to give you an interesting benchmark—on roughly the same system, roughly optimized the same way, a benchmark from 1979 at Xerox PARC runs only 50 times faster today. Moore’s law has given us somewhere between 40,000 and 60,000 times improvement in that time. So there’s approximately a factor of 1,000 in efficiency that has been lost by bad CPU architectures.” … “We’re not going to worry about whether we can compile it into a von Neumann computer or not, and we will make the microcode do whatever we need to get around these inefficiencies because a lot of the inefficiencies are just putting stuff on obsolete hardware architectures.”
Jamie Zawinski, an author of Mozilla: “In a large application, a good garbage collector is more efficient than malloc/free.” … “Don’t blame the concept of GC just because you’ve never seen a good GC that interfaces well with your favorite language.” Elsewhere: “it’s a misconception that lisp is, by its nature, slow, or even slower than C” … “if you’re doing a *big* project in C or C++, well, you’re going to end up reinventing most of the lisp runtime anyway”
Steve Yegge, a great tech blogger: “The von Neumann machine is a convenient, cost-effective, 1950s realization of a Turing Machine, which is a famous abstract model for performing computations.” … “There are various other kinds of computers, such as convenient realizations of neural networks or cellular automata, but they’re nowhere as popular either, at least not yet”. And… “The Von Neumann architecture is not the only one out there, nor is it going to last much longer (in the grand 400-year scheme of things.)”
Wow. Sounds dazzling and mind-opening, doesn’t it? Except there isn’t any technical detail whatsoever. I mean, it’s important to be open-minded and stuff. It really is. The fact that something doesn’t seem “practical” doesn’t mean you shouldn’t think or talk about it. But if something isn’t even a something, just a vague idea about Awesome Coolness, it poisons the readers’ minds, people. It’s like talking about Spirituality of the kind that lets you jump over cliffs at your mighty will or something (I’m not that good at New Age, but I think they have things like these in stock). This can only lead to three results:
- Your reader ignores you.
- Your reader sits on a couch and waits to gain enough Spirituality to jump around cliffs. Congratulations! Your writing has got you one fat fanboy.
- Your reader assumes he’s Spiritual enough already and jumps off a cliff, so you’ve got a slim fanboy corpse.
It’s the same with this Great High-Level Hardware talk. I can ignore it, or I can wait forever until it emerges, or I can miserably fail trying to do it myself. Seriously, let’s look at these claims a little closer.
Alan Kay mentions a benchmark showing how lame our CPUs are. I’d really like to see that benchmark. Because I’ve checked out the B5000 which he praised in that article. And I don’t think a modern implementation of that architecture would beat a modern CPU in terms of raw efficiency. You see, RISC happened for a reason. Very roughly, it’s like this:
- You can access memories at single cycle throughput.
- You can process operands in registers at single cycle throughput.
- And that’s pretty much what you can do.
Suppose you want to support strings and have a string comparison instruction. You might think that “it’s done in the hardware”, so it’s blindingly fast. It isn’t, because the hardware still has to access memory, one word per cycle. A superscalar/VLIW assembly loop would run just as quickly; the only thing you’d save is a few bytes for instruction encoding. On the other hand, your string comparison thingie has got you into several sorts of trouble:
- Your machine is larger, with little gain - you don’t compare strings most of the time.
- Your machine is complicated, so optimizing the hardware is trickier.
- Compilers have trouble actually utilizing your instructions.
- Especially as the underlying hardware implementation grows more complicated and the performance of assembly code gets harder to model.
When people were supposed to write assembly programs, the inclusion of complicated high-level instructions was somewhat natural. When it became clear that compilers write most of the programs (because compilation became cheap enough), processors became less high-level; the points above hopefully explain why.
And don’t get me started about the tagging of data words. B5000 had polymorphic microcode - it would load two words, look at their type bits and add them according to the run time types. Well, B5000 didn’t support things like unsigned 8-bit integers, which happen to be something I need, because that’s how you store images, for example. Am I supposed to carry tag bits in each frigging byte? Let me point out that it has its cost. And I don’t think this sort of low-level polymorphism dwarfs the cost of Lisp or Smalltalk-style dynamic binding, either (B5000 was designed to run Algol; what would you do to run Smalltalk?)
There’s another angle to it: Alan Kay mentions that you almost couldn’t crash the B5000, which suited the business apps it was supposed to run quite well. I think that’s just awesome, I really do (I shoveled through lots of core dumps). In fact, I think people who implemented modern desktop operating systems and web browsers in unsafe languages on top of unsafe hardware are directly responsible for the vast majority of actual security problems out there. But (1) in many systems, the performance is really really important and (2) I think that security in software, the way it’s done in JVM or .NET, still has lower overall cost than tagging every byte (I’m less sure about part 2 because I don’t really know the guts of those VMs). Anyway, I think that hardware-enforced safety is costly, and you ought to acknowledge it (or really show why this fairly intuitive assumption is wrong, that is, delve into the details).
JWZ’s Lisp-can-be-efficient-on-stock-hardware claim isn’t much better than Smalltalk-can-be-efficient-on-custom-hardware, I find. Just how can it be? If you use Lisp’s static annotation system, your code becomes uglier than Java, and much less safe (I don’t think Lisp does static checking of parameter types, it just goes ahead and passes you an object and lets you think it’s an integer). If you use Lisp in the Lispy way that makes it so attractive in the first place, how on Earth can you optimize out the dynamic type checks and binding? You’d have to solve undecidable problems to make sense of the data flow. “A large project in C would implement the Lisp run time?” Oh really? You mean each variable will have the type LispObject (or PyObject or whatever)? Never happens, unless the C code is written by a deeply disturbed Lisp weenie (gcc and especially BetaPlayer, I’m talking about you). The fact that some people write C code as if they were a Lisp back-end is their personal problem, nothing more, nothing less.
The dynamic memory allocation business is no picnic, either. I won’t argue that garbage collection is significantly less efficient than manual malloc/free calls, because I’m not so sure about it. What I will argue is that a good Lisp program will use much more dynamic allocation and indirection levels than a good C program (again, I ignore the case of emulating C in Lisp, or Lisp in C, because I think it’s a waste of time anyway). And if you want to make your objects flat, I think you need a static type system, so you won’t be much higher-level than Java in terms of dynamic flexibility. And levels of indirection are extremely costly because every data-dependent memory access is awfully likely to introduce pipeline stalls.
Pure functional languages with static typing have their own problem - they lack side effects and make lots of copies at the interface level; eliminating those copies is left as an exercise to the compiler writer. I’ve never worked through a significant array of such exercises, so I won’t argue about the problems of that. I’ll just mention that static typing (irregardless of the type inference technique) characterizes lower-level languages, because now I have to think about types, just the way imperative programming is lower-level than functional programming, because now I have to think about the order of side effects. You can tell me that I don’t know what “high-level” means; I won’t care.
Now, the von Neumann machine business. Do you realize the extent to which memory arrays are optimized and standardized today? It’s nowhere near what happens with CPUs. There are lots of CPU families running lots of different instruction sets. All memories just load and store. Both static RAM (the expensive and fast kind) and dynamic RAM (the cheap and slower kind) are optimized to death, from raw performance to factory testing needed to detect manufacturing defects. You don’t think about memories when you design hardware, just the way you don’t think about the kind of floating point you want to use in your numeric app - you go for IEEE because so much intellectual effort was invested in it on all levels to make it work well.
But let’s go with the New Age flow of “von Neumann machine is a relic from the fifties”. What kinds of other architectures are there, and how do you program them, may I ask? “C is for von Neumann machines”. Well, so is Java and so is Lisp; all have contiguous arrays. Linked lists and dictionaries aren’t designed for any other kind of machine, either; in fact lots of standard big O complexity analysis assumes a von Neumann machine - O(1) random access.
And suppose you’re willing to drop standard memories and standard programming languages and standard complexity analysis. I don’t think you’re a crackpot, I really don’t; I think you’re bluffing, most probably, but you could be a brilliant and creative individual. I sincerely don’t think that anything practiced by millions can automatically be considered “true” or “right”; I was born in the Soviet Union, so I know all about it. Anyway, I want to hear your ideas. I have images. I must process those images and find stuff in them. I need to write a program and control its behavior. You know, the usual edit-run-debug-swear cycle. What model do you propose to use? Don’t just say “neural nets”. Let’s hear some details about hardware architecture.
I really want to know. I assume that an opinion held by quite some celebrities is shared by lots and lots of people out there. Many of you are competent programmers, some stronger than myself. Tell me why I’m wrong. I’ll send you a sample chip. I’ll publicly admit I was a smug misinformed dumbass. Whatever you like. I want to close this part of “efficient/high-level isn’t a trade-off” nonsense, so that I can go back to my scheduled ranting about the other part. You know, when I poke fun at C++ programmers who think STL is “high-level” (ha!). But until this “Lisp is efficient” (ha!) issue lingers, I just can’t go on ranting with clear conscience. Unbalanced ranting is evil, don’t you think?
40 comments ↓
How about trying to implement Parrot’s VM?
http://www.parrotcode.org/
Exactly: /how/ should I go about implementing Parrot’s VM? :) And why would it be more efficient than running it in software on a standard RISC core hooked to standard RAM?
I’m not saying there are no good VMs for HLLs, just that implementing them in hardware wouldn’t increase the overall system efficiency (performance/price).
How about adding processor-level instructions for regular expression matching, or for higher-level operations used in regular expression matching? I’m not a processor guy, so I don’t know if that’s just a pipe dream, but wouldn’t it be faster to run almost any text-processing language (think perl, python, ruby) if fewer instructions are required to perform the comparisons? This might be impossible, but then again … maybe it’s just really, really difficult.
What about FPGAs? We run a soft core on a cheap FPGA at low MHz but get lots of acceleration from “coprocessor” cores that were compiled from C code. Instead of developing custom chips for one specific HLL maybe we’d like some FPGA coprocessors that can be loaded up with task-specific logic as needed.
You mention image processing - I remember using image processing boards in the late 80s that had FPGA front ends because the PCs were just too slow. Maybe it’s time to resurrect that approach.
First of all, if one’s going to try to implement a virtual machine (seems that would defeat the purpose of it being a VM, but hey whatever works), something like JVM or LLVM would likely be much more productive. Of course you’re left with the problem that they are essentially just slightly higher level than a normal CPU, nothing drastic. And Yossi is right, the trick is - how would you even do this?
My question is this - there was once a time when memory hierarchies were shallow enough that many if not most programs were CPU bound. Except in tightly pipelined systems, cryptography, and a few cases where datasets are so small that they fit into on-chip cache, this no longer seems to be the case. For that reason, I cannot possibly understand what performance benefit we’d see from implementing a higher level instruction set - unless it prevents us from accessing memory in a cache-incoherent manner. Anyone have thoughts on this?
I think Yossi would agree that it might be nice to have a standard CPU architecture that was easier to generate efficient code for (x86 is a bitch, and it’s intended successor EPIC was so much harder that the scientists I worked with had to code entire chunks of program in assembly, or suffer 300-400% performance hits easily). Indeed, if anything modern compiling has taught us it’s that reducing code size (and therefore instruction cache misses) is a big deal. Perhaps an architecture that had operands which map to higher level (but still fundamentally “C-like”) constructs, and then compiled them down to internal RISC opcodes might burn some silicon to save IC misses, as a form of code compression.
We already live in an age where (for some applications) dynamically compiled languages like Java can be *faster* than statically compiled C (if you don’t code like a slob, and have twice the RAM you’d need in C) due to runtime analysis, hotspot optimization, etc. CPU cores idling, just waiting for something to do while they fetch from RAM. If anything, what we need is not a better CPU, we need libraries, languages, and compilers that support dynamic program transformations to encourage cache-coherency. Get the data accesses in your LISP or Java program clustered appropriately, and you’ll see significant gains. Move away from the “arrays are contiguous blocks of RAM I index directly” paradigm of C, into something more along the lines of “arrays are fast O(1) containers for things I index via integer keys, which may be transformed at run-time depending on the optimized internal structure.”
http://www-03.ibm.com/systems/z/zaap/
(Specialty hardware for running Java?)
Does that count?
I have no idea what the zaap does - but that’s almost what it appears to be..
I’d agree that high level languages certainly have their cost on von Neuman machines, but you’re obviously failing Yegge’s test for common sense. You’re forgetting that we can make computers out of things other than switches. There have been general purpose analog neural networks like the Intel ETANN and Bellcore CLNN32/64. There have been digitally implemented neural networks like the IBM ZISC036. In fact I believe the ZISC036 meets all the requirements of your challenge. (See: http://www.it.hiof.no/prosjekter/hoit/html/nr1_96/zisc036.html)
If you are arguing that the von Neumann machine is likely the best computer you can build based on switches and current knowledge, then I’d agree.
Efficiency is a wide field. Efficiency in what dimension, for what problem? There have been situation where DNA and enzyme based computers are able to solve problems 100,000 times faster than it would have taken a state of the art von Neumann machine.
Transistors, as you point out, have had a lot of effort put into their optimization, but I think that Yegge’s whole points was that there exists technologies that when optimized will far exceed the capabilities even the best swtich.
Great article, I’ll get some beer and wait …
Okay, I’ll bite.
Here’s my submission: http://www.davidbrunton.com/2008/01/submission-to-high-level-cpu-challenge.html
Now I’ll sit back, also with a beer, and wait for my free chip ;)
As someone who has written a high level language, the major expense I’ve seen is for hash table lookups (for dynamic binding) which take many cycles. I’ve been told that associative memory hardware could reduce these lookups to a single cycle - is this true?
Interesting post. What about Lisp machines? http://en.wikipedia.org/wiki/Lisp_machine
As I understand it they were special purpose hardware to run Lisp programs faster than the general purpose hardware of the day.
Since I like the idea of side-effect-free HLLs, I’d like to see hardware support for reference-counting, because side-effect-free languages cannot create cyclic data structures. As far as I can tell, very little research has gone into making reference-counting faster (probably because people have always assumed that we would need to collect circular structures), but it seems like an ideal problem for hardware to solve. Assuming that the first field of an object is its reference count, you’d have instructions like “Copy and increment referent” and “Decrement referent and jump if zero”. Much of the slowness of side-effect-free work could also be eliminated by not copying objects if their reference count is about to become zero (a common case when simulating side-effects — the optimization turns the simulated side-effect into a real one).
As I understand it, one of the problems with reference counting is that it writes to memory a whole lot. An associative cache for reference counts could alleviate that. Other reference counting tricks, like one-bit reference counting or weighted reference counts, could become much faster with hardware support.
While I’m dreaming, it’d also be nice to see better tools for lightweight parallelism. All the silicon that’s spent on deep pipelines and sophisticated branch prediction and register scoreboarding could go to increasing the number of cores, provided compilers took advantage of them. I have to admit that I don’t have a clear idea of what the instructions for ultra-lightweight parallelism would look like, but the potential usefulness is limited only by the amount of time overhead needed to ascertain the number of free cores and start them working on a slice of the data. The compiler can ensure that they don’t need to touch the stack for the duration of the data-level parallelism.
Specialized cores that can only add, subtract, and multiply integers would be exceptionally useful, because lots of work just doesn’t involve floating-point math. (http://bloggablea.wordpress.com/2007/04/27/so-does-anyone-even-use-all-these-darn-cpu-instructions/)
Alternately, imagine a set of cores, each with a dedicated bus connected to dedicated memory (in addition to main memory). The memory could be addressed such that the cores are interleaved at the word level. One could drop an array into that area of memory and have all the cores operate on it in parallel. I feel like I’m describing something that already exists, though.
I feel that creating hybrid processors, based on von Neumann machines, but with special optimizations for Haskell-like languages, will produce the fastest machines for HLLs of the future. Our universe is very skewed towards the von Neumann model, and sometimes our problems map to it nicely, but sometimes they are much more like lambda calculus.
dbrunton and stevedekorte beat me :(
(oh, and szetlan is saying the same thing steve is maybe without realizing it…)
ideally, the answer is cellular automata, i read someplace that the reason von neumann used unified memory was to facilitate code modification in automata… i wonder what he’d think about modern computer viruses…
i think “associative memory hardware” gets you more than half way to solving the HLL challenge and is a lot more practical than a cellular automata hardware implementation (as i understand it, current chip fabrication and wiring make 3D matrices pretty cost ineffective.)
if one is going to go through the trouble of a hardware neural network implementation, (which doesn’t actually solve the challenge at all,) they might as well just go the full monty and implement cellular automata, the two aren’t really that much different; i think cellular automata are technically more space efficient.
FPGAs aren’t really an answer; they are arrays of one bit adders, as many others before me have pointed out. in other words very stupid, very slow, 2D cellular automata.
so, in some regards i think new fabracation techniques and the old standby-never-quite-implemented-efficiently associative memories and cellular automata will be the “solution”
incidentally, phase change memory is well suited for all of the above things… as well as other existing 3D fab techniques.
I like the Niagra2. The onboard ethernet has a chunk of silicon that lets it queue up events on different cpu’s depending on different filtering criteron, thats extremely smart when you’re using 8 massively wide cores as stream processors/request processors off some 10G feeds. To some degree it turns the computer into a multicomp, and lets each cpu keep its own memory (aka avoid shared memory). Add in a crypto processor capable of saturating the 10G link and you can replace half your networking gear. Going back to the massively wide architecture and using something like HT is very smart as well, basically providing a bunch of functional units and then trying to keep them all fed. If the question is “how do you scale out,” Sun has some very good answers, and most of them appear on die.
Bonus q: how do you feed 8 extra-wide cores? Four dual channel banks of FBDimms. Nice.
The spagetti stack hardware architecture of the Burrough’s matches closely to continuations, which we’re seeing more and more of for implementing asych event handling routines. This would potentially be excellent for dealing with highly concurrent systems.
I’m no expert on the Burroughts, but I dont think the burroughs was particularly a CISC v. RISC fight, no more than the x86’s CS, DS, SS, ES, you could implement most of its architecture in RISC form. Its just that the “segment registers” notion was take na little higher level with the Burroughts, to let the processor know a little more about the stack and to let the stack grow in different ways. The most glaring counterclaim would be having to chase down the stack and decend lexical levels to dig up another variable, but there’s plenty of caching opportunities here, X86 adds its own TLB’s all the time to deal with the same class of problems for segmentation and indirect lookups. Who knows, its probably a terrible idea, I certainly dont know enough about Burroughs like systems to judge, but I think it would be really fascinating to try and make an object oriented cpu or a spagetti stack cpu or something other than the segmented stack machine. (PPC has some interesting potential with its inverted page tables)
Although it would still essentially be a von Neumann computer, I would really love to see a CPU which is a graph reducer and implemented SK combinators.
Has this been done before?
It would need to support GC explicitly, and some other memory and math operations, but implementing a functional language on this sort of architecture would be a breeze, obviously.
My understanding is that when the JVM needs a chunk of blank memory the system carefully loads the previous contents of the memory to cache, which the VM then clobbers. A design of cache/memory architecture that eliminated the redundant read would improve performance.
Not my idea, though I wish I could take credit for it.
It’s called the Reduceron2.
http://www-users.cs.york.ac.uk/~mfn/reduceron2/
“The Reduceron is an extremely simple machine, containing just 4 instructions, and executes core Haskell almost directly.” <- from the website.
Pretty cool, does this meet the criteria for someone (though not me) actually doing something instead of just talking about it?
Of course it does! Reduceron is a pretty cool project.
I’m in the process of chasing the various links people sent and digesting everything. I’ll publish a summary of the interesting stuff when I’m done.
[...] – Alan Kay, leído en The “high-level CPU” challenge [...]
More than 10 years ago, I saw a proposal for a processor with one ALU and 128 program counters and sets of registers. The intention was that it would be exceptionally good at implementing multi-threaded applications. Unfortunately, suitable software was quite lacking in that era. However, the theory is very good.
If you design a processor which is extremely multi-threaded then do not need to implement, deep pipelines, speculative execution, branch prediction, out-of-order execution (and reconciliation) or register shadowing.
High level languages are very problematic for pipelines. This is especially true when running object oriented languages. The problem is that method calls are dependent on the instance being handled. A tree or network of classes have some common methods. Additional methods may be lazily loaded during execution. This arrangement hinders compilation techniques. It also affects pipelining.
A common case is a method calling a method of another class. This process can involve multiple memory accesses. Each method call also requires a jump to an address which is unknown until very briefly before the jump is made. This is problematic for performance if you have a deep pipeline which decodes instructions over many clock cycles. It is possible to defer write operations to increase read throughput. This can reduce immediate impact but it increases complexity and creates additional opportunities for pipeline stalls. Furthermore, it doesn’t change the fact that your ALU is idle between method calls.
Instead, it would be more productive to have a very large number of hardware threads. ALU, registers and instruction set could be designed with minimal complexity. It would be possible to implement eight simple ALUs, each with eight sets of registers. Therefore, you’d have eight threads per ALU. Threads within a group would be allocated processing time with strict preference. Each group is run out of phase.
One or more dedicated instruction caches are not required because instructions can be retreived from secondary cache within eight cycles. If access to main memory takes 20 cycles then a thread will block and two lesser threads within its group can be run. Additional memory accesses will cause additional threads to block. Regardless, throughput from the cache should be optimal and one or more ALUs will be in use almost constantly. However, unlike contemporary designs, this remains true when running OO and bytecode programs.
I’d suggest implementing a two address machine. This allows a rich but compact set of instructions to be loaded, cached, transferred and decoded efficiently. I’d implement a subset of the complexity of a TMS9900 or a MC68000 because it would give the best throughput versus complexity. If you stripped the instructions which required microcode then the remainder would easily complete within eight cycles.
You’d probably want a symmetric design, for simplicity. Regardless, it would be possible to have differing ALUs, registers and instruction sets within one implementation. One or more ALUs could be optimised for DSP operations. One or more ALUs could be optimised for bit operations.
This idea works very efficiently with existing software. It doesn’t require fancy compilers. It doesn’t require exceptional bus bandwidth. I just wish it was mine.
Hyper-threading (lots of reg sets per ALU) has its problems:
1. Some workloads are hard to parallelize
2. Some workloads are ALU-bound
Basically, AFAIK you hit diminishing returns at the reg set #4-#5 - the chances to increase actual throughput aren’t high enough to justify the die size.
Hyperthreading doesn’t remove the cost of HLLs, although it can help fill the stalls in the pipeline, sufficiently reducing this cost; however, it won’t help in the (probably frequent) case when your vtables or whatever it is that causes the indirection are already in the cache (unless the core has a high load-from-cache latency, which is a problem by itself).
I read this article a couple weeks ago, and started writing a response. It grew large enough that I decided to post it as a separate article rather than in this tiny little comment window.
A summary of the proposal is a second TLB intended to be managed directly by application code, allowing cheap copy-on-write mappings for functional language argument passing.
http://codingrelic.geekhold.com/2008/05/high-level-cpu-response.html
You should update your article, because the Lisp type system does not work the way you think it does. SBCL, a commonly-used Lisp system, does refuse code that does not pass its type checking (both static and dynamic) and does compile away the checks on code it knows is safe (unboxing is done as well). As for Lisp ever being uglier than Java: Macros buy me so much in terms of expressiveness Java seems like a bad joke. That, plus the fact Java doesn’t seem to have a workable type system, clinches it for me.
Peter Seibel “should” update his book then. Quote:
“… to make Common Lisp compile add more or less like a C compiler would, you can write it like this:
(defun add (x y)
(declare (optimize (speed 3) (safety 0)))
(declare (fixnum x y))
(the fixnum (+ x y)))
Of course, now the Lisp version suffers from many of the same liabilities as the C version–if the arguments passed aren’t fixnums or if the addition overflows, the result will be mathematically incorrect or worse.”
I guess it has something to do with the (safety 0) thingie. It has the advantage of yielding assembly code identical to that of a statically typed C version, but is apparently even less safe.
Can SBCL be better than what the CL standard apparently mandates implementations to be, specifically, be as fast as C due to explicit typing or type inference without sacrificing safety? I dunno. SBCL shootout benchmarks run ~2x slower than C. And regarding “beauty” compared to C/Java/… - (the fixnum (+ a b)) when a and b are already declared as fixnums is something you wouldn’t have to do in C or Java. SBCL benchmark implementations also have their share of declare and convert calls. Not to mention the (declare (optimize…)) from the book.
That macros buy expressiveness doesn’t contradict anything of what I said. I didn’t diss Lisp, praise Java or claimed Lisp was linguistically inferior to Java (ha!). I claimed the following:
1. Implementing support for Lisp in hardware doesn’t reduce system cost compared to an implementation on a “traditional” machine and an optimizing compiler.
2. Lisp intrinsically costs more than a low-level language like C; neither hardware nor software can hide it completely. It doesn’t mean you should use low-level languages, it means just what I said.
Also, Lisp with all the declarations you’d make in C with all the limitations you’d have in C isn’t Lisp with types, it’s C with sexprs (I’d like to have C with sexprs, but that’s another matter). That is hardly provable/refutable though; it’s a personal preference/perception thing.
P.S. Java has a “workable” type system since quite a bunch of systems were delivered with it. It doesn’t mean I want to work in Java; by the test above, C++ is “workable”, and in case you didn’t notice, I despise C++. If people get work done in it, it’s “workable”. But that’s just a side note.
hey, dont take my money.
Well, I think that Alan Kay and many more are actualy pointing out bad cpu designs.
Two examples are branch prediction and memory caches.
I rather have Broadband Engine Cell type of proccessor where each core has private scratch memory and DMA between those and main memory. Cache coherrency proplem and branch prediction penalty disapear and less silicon real estate is taken up by cache circuitry. Add propper hardware level subroutine handling (a la Forths returnstack) and possibility of fitting more code into constrained scratch memory gives you more functionality per kilobinarybyte. Add register file and rstack DMAing and you got yourself a faster task switching. And if you could add fpga matrix as “one core” on the chip that is an bonus too ;-)
Reminds me of http://www.loper-os.org/
I remember seeing a presentation a while back about a multi-core system that had two stacks per core. and 60 cores (or something like that). It was sort of like Forth in hardware. I can’t remember where I saw it, but it might be on google video or something if you want to search (but I can’t remember the guys name…)
You’re correct that most features of a high level language wouldn’t be accelerated with special hardware. Specifically, type tagging and microcoding are a big lose. As another commenter pointed out, modern processing performance is memory bound, not CPU-bound. If you can keep your computations “on-chip,” then you’re better off with a simple RISC implementing those calculations rather than special-purpose hardware. The only minor win is that RISC might have slightly more icache bandwidth required to fetch the additional instructions. Otherwise, keep the core simpler. This is why tagging isn’t a big deal. Tags can easily be stripped/added/compared while keeping everything in registers.
I’d point out that there is no reason that Lisp can’t run as fast as C, even without type declarations. Modern dynamic compilation can produce type-specific code at runtime that will be highly optimized for the actual types in use. Most function parameters are of the same type on each call can therefore run at full speed. See Andreas Gal’s work on trace compilers, for instance. With multicore, you could even dedicate a core to handling the compilation tasks in parallel, as code continues to run.
Now, all that said, one area where high level languages do burn a lot of cycles is in memory management. In fact, all languages tend to burn a lot of performance in memory management, but HLLs suffer more because they tend to put more in the heap and they are typically garbage collected. The issue of what goes in heap is a compiler problem and I won’t say anything about it. GC, however, can be made to operate in parallel with special hardware. This does not necessarily mean full HW GC. Much of the actual GC algorithm can still be implemented on a simple software co-processor, but the special hardware allows you to intercept memory references issued by the main CPU to keep the memory model coherent (think read and/or write barriers that operate in parallel and simply stall the main CPU for a cycle or two when needed, until the memory state becomes coherent again). This eliminates comparisons and branches associated with every read/write (depending on when you implement the barrier). It also eliminates, or greatly reduces any pauses caused by “stop the world” GC requirements. Things like tracepoints can also be optimized. Finally, a lot of the performance-robbing GC bookkeeping can be reduced/eliminated (think card-marking and back-pointer checking).
I’m still going to handwave here, but it seems like this area would definitely help HLL performance. In the same way that everybody implements the V-to-P translation of modern virtual memory systems in hardware using a TLB, this would speed up memory specifically for the requirements of HLLs.
Finally, I’ll just point you an email conversation between Cliff Click at Azul and Dave Moon and Dan Weinreb, formerly at Symbolics. They discussed the hardware GC support of the Azul hardware Java processor. It definitely seemed like a win over stock RISC.
Sorry, forgot the reference: http://blogs.azulsystems.com/cliff/2008/11/a-brief-conversation-with-david-moon.html
I understand that when reclaiming old memory Java wants to zero it.
And that current CPU’s load the old contents from slow memory into cache just so it can be zero’d.
What about a way to not load current memory contents into cache and just write to it?
I mean, the best of high-level architecture if Intel iMAX-432.
It is too complex, but it can be simplified.
To ‘just write to memory’ wont work. You have to ensure that all processor cores have a consistent view of that memory space first, this means synchronizing caches among other things. If I just zeroed out the main memory, other processors could still be using data in their L2/L3 caches which they think is fresh, not realizing that another processor on the system has declared the area dead. Equally, they could then write their contents back over my newly zeroed area. So given that you have to pause and evaluate the contents of all other core’s caches, you may as well go ahead and just load the area into cache, write it and then use the standard cache consistency algorithms to handle the writeback rather than having to implement and debug a special case.
Personally, I’d just love to see people target the correct architectures when they compile stuff. Sure, i386 and i586 will work on i686, but it also prevents the use of any modern features at all. The move to 64bit is less useful for the expanded address space than it is for doubling the register count (to 16, half of what a POWER core has) and forcing people to target an up-to-date architecture.
If I had my way, the world of processors would be a lot more RISC than currently. I just wish I could afford a Power6 or SPARC system to play with. I’m also keen to see how the hardware scout feature of the SPARC Rock series processors turns out.
I think the problem is that you want to solve the wrong problem. As you said, high levels of indirection are costly, and you want to keep the access of flat memory structures fast.
However, IMHO the image processing application is an extremely specific example. Not very many things are stored in flat structures these days. Even databases are flat only as long as you only don’t want massively parallel access, only want to read them, etc.
The point is: real world data is often (way more often than not) composed of complicated data structures full of indirection. Modern “killer applications” are not about sequential, predictable operations over a structured array; instead, they’re about complex methods of updating these deep and indirect data structures - and indeed, this is the main and natural domain of these HLLs you mention.
We can keep specialized CPUs or architectures for sequential operations (in fact, we already have these; they’re called GPUs); and we should forget about optimizing our main CPUs for a case of flat data.
I want to create HLL implementations that will work efficiently on my customers’ many cores in the years to come but I cannot buy a manycore system today so I cannot test my ideas. I want robustly-deployable binaries of immersive graphical applications but today’s separation between the CPU and GPU requires graphics drivers that are a major weak point in the reliability of the whole system, massively increasing our support costs and even forcing us to abandon product lines.
My proposed solution: an affordable manycore CPU today.
Many cores executing an existing instruction set: can reuse existing OS and compiler technology like Linux and LLVM. Good programmable graphics capability from multithreaded software without an error-prone graphics driver.
On a separate note, concurrent garbage collection is becoming increasingly important and is already a major performance bottleneck. I have no idea how this might be done but perhaps dedicated hardware can help speed this up? Also, work stealing queues (e.g. Cilk, Microsoft Task Parallel Library) are an excellent general-purpose solution for efficient parallelism. Perhaps they could be built into each core in order to lower the overheads?
I agree with author we don’t need new hardware for heigh level languages. What we really need, it is start thinking more efficiency. Good discipline and education could make programmer use low-level or “difficult” languages as they now using heigh level languages.
“If you use Lisp in the Lispy way that makes it so attractive in the first place…”
Then you only need to optimize the tiny part of your program that’s the *bottleneck*. It’s a lot easier to rewrite a function or two from Lisp into (slightly different) Lisp, than it is to write it in C and interface it and link it in (at least, in any HLL I’ve ever used).
““A large project in C would implement the Lisp run time?” Oh really? You mean each variable will have the type LispObject (or PyObject or whatever)?”
No, he means you end up with a half-dozen specialized ad-hoc MAPCAR implementations, and a half-dozen specialized ad-hoc FILTER implementations, and so on. Every large C project I’ve ever seen does this. To counter-challenge: I’d be curious to see one that didn’t.
Your challenge is a good one, if you could constrain it a bit (is there a good simulator or something we should use?). But JWZ didn’t say anything about changing machine architectures. I don’t know why you think his quote is at all related to the others, or to your subject.
As has been said in other comments: primitives that help GC on a RISC core could help. Also, allow cheap thread creation without going through the OS, so it would be feasible to run a sequence of a few hundred instructions on another core.
Performance of dynamic languages depends for a much greater part on smart compilers, jits, and dynamic optimizers that take the actual runtime values of data the program is handling into account. To do that, the program has to be profiled, which on current hardware allways has some performance impact (or at least is a tradeoff between profiling accuracy and performance). How about one core sending hardware generated profiling data to a second core that can do the analysis, so there is no runtime cost to profiling (as long as enough cores are available). If this happens, profiling everything, including production use, could become the norm. This in itself does not speed up anything, but I think it would improve the opportunities for both compilers and developers a lot.
But a lot of the current performance challenges for HLLs probably need to be solved with software (jits etc) instead of hardware.
@Somejan, you say “Also, allow cheap thread creation without going through the OS, so it would be feasible to run a sequence of a few hundred instructions on another core.”
I don’t think this would help much. Thread creation is such an OS-specific, or even language-specific in the case of Erlang, operation that I don’t think you could include it as the level of the CPU.
I’m a believer that GC *assist* is the right answer, but it’s only assist. Different languages and operating systems are going to want to manage GC in a slightly different way. The best you can do is implement some of the more painful primitives and let them use those, as opposed to putting a full GC algorithm in hardware itself.
[...] who believe that a CPU ought to be designed specifically around the needs of high-level languages: Do you think high-level languages would run fast if the stock hardware weren’t “brain-damage… My challenge is this. If you think that you know how hardware and/or compilers should be designed [...]
Leave a Comment