April 21st, 2008 — software
So the other day I both defined Tcl procedures all by myself and used them interactively, and I liked it, to the point where it felt like some kind of local optimum. This entry is an attempt to cope with the trauma of liking Tcl, by means of rationalizing it. I’ll first tell the story of my fascinating adventures, and then I’ll do the rationalizing (to skip the oh-so-personal first part, go straight for the bullet points).
Here’s what I was doing. I have this board. The board has a chip on it. The chip has several processors in it. There’s a processor in there that is a memory-mapped device, and I talk to it using CPU load/store commands. The CPU is itself a JTAG target, and I talk to it via a probe, issuing those load/store commands. To the probe I talk via USB using an ugly console that can’t even handle the clipboard properly. The console speaks Tcl.
Net result: in order to talk to the memory-mapped processor, I have to speak Tcl. Or I can use a memory view window in a graphical debugger. Except some addresses will change the processor state when read (pop an item from a LIFO, that sort of thing). So no, memory view windows aren’t a good idea - you have to aim for the specific address, not shoot at whole address ranges. Damn, just who thought of defining the hardware in such a stupid way? Why, it was me. Little did I know that I was thereby inflicting Tcl on myself.
Anyway, there’s this bug I have to debug now, and as far as I know it could be in at least three different pieces of software or in two different pieces of hardware, and I don’t like any of the 5 options very much, and I want to find out what it is already. So at the ugly probe console I type things like word 0x1f8cc000 (reads the processor status), word 0x1f8cc008 2 (halts the execution), word 0x1f8cc020 0x875 (places a breakpoint). I get sick and tired of this in about 2 minutes, when the command history is large enough to make the probability of hitting Enter at the wrong auto-completed command annoying. It’s annoying to run the wrong command because if I ruin the processor state, it will take me minutes to reproduce that state, because the program reads input via JTAG, which is as slow as it gets.
So I figure this isn’t the last bug I’m gonna deal with this way, and it’s therefore time for Extending my Environment, equipping myself with the Right Tools for the Job, Tailored to my needs, utilizing the Scripting capabilities of the system. I hate that, I really do. Is there something more distressing than the development of development tools for the mass market of a single developer? Can a programmer have a weakness more pathetic than the tendency to solve easy generic meta-problems when the real, specific problems are too hard? Is there software more disgusting in nature than plug-ins and extensions for a butt-ugly base system? But you know what, I really fail to remember 0×1f8cwhat the breakpoint address is. This story has one hexadecimal value too much for my brain. OK, then. Tcl.
I decided to have one entry point procedure, pmem, that would get a memory-mapped processor id, ’cause there are many of them, and then call one of several functions with the right base address, so that pmem 0 pc would do the same as pmem_pc 0x1f8c0000. Well, in Tcl that’s as simple as it gets. Tcl likes to generate and evaluate command strings. More generally, Tcl likes strings. In fact, it likes them more than anything else. In normal programming languages, things are variables and expressions by default. In Tcl, they are strings. abc isn’t a variable reference - it’s a string. $abc is the variable reference. a+b/c isn’t an expression - it’s a string. [expr $a+$b/$c] is the expression. Could you believe that? [expr $a+$b/$c]. Isn’t that ridiculous?
In fact, that was one of my main applications for Tcl: ridiculing it. I remember reading the huge Tcl/Tk book by Brent Welch with my friend once. There was a power outage and it was past the time when the last UPS squeaked its last squeak. And the book was there ’cause the hardware guys use it for scripting their lovecraftian toolchain. We really did have fun. Tears went down my cheeks from laughter. Even people with the usual frightened/mean comments about those geeks who laugh their brains out over a Tcl book didn’t spoil it. So, ridiculing Tcl, my #1 use for it. The other use is the occasional scripting of the hardware hackers’ lovecraftian toolchain. Overall, I don’t use Tcl very much.
The nice thing about Tcl is that it’s still a dynamic language, and reasonably laconic at that, modulo quoting and escaping. So I enter the usual addictive edit/test cycle using tclsh < script. N minutes down the road (I really don’t know what N was), I’ve finished my 2 screenfuls of Tcl and the fun starts. I actually start debugging the goddamn thing.
$ pmem 0 stat
IDLE
$ pmem 0 bkpt 0 0xbff
$ pmem 0 bkpt 1 0xa57
$ pmem 0 cmd run
$ pmem 0 stat
DEBUG
$ pmem 0 pc
0xbff
$ pmem 0 rstack
3 return addresses
addr 0: 0x0005
addr 1: 0x05a8
addr 2: 0x0766
$ pmem 0 cmd stp
$ pmem 0 pc
0xc00
Weeee! HAPPY, HAPPY, JOY, JOY!
You have no idea just how happy this made me. Yeah, I know, I’m overreacting. I’ll tell you what: debug various kinds of hardware malfunction for several months, and you’ll be able to identify with the warped notion of value one gains through such process. On second thought, I don’t know if I’d really recommend it. Remember how I told low-level programming was easy? It is, fundamentally, but there’s this other angle from which it’s quite a filthy endeavor. I promise to blog about it. I owe it to the people who keep telling me “so low-level is easy?” each time they listen to me swear heartily at a degenerate hardware setup where nothing works no matter what you try. I owe it to myself - me wants to reach a closure here. Why should I tolerate being regularly misquoted at the moments of my deepest professional catharsis?
Aaaanyway, in just N minutes, I bootstrapped myself something not unlike a retarded version of gdb, the way it would work if the symbol table of my program was stripped. But no matter - I have addr2line for that. And the nice thing about my retarded debugger front-end is that it looks like shell commands: blah blah blah. As opposed to blah("blah","blah"). And this, folks, is what I think Tcl, being a tool command language, gets right.
I come from the world of pop infix languages (C/Java/Python/Ruby/you name it). Tcl basically freaks me out with its two fundamental choices:
- Tcl likes literals, not variables. Tcl:
string, $var. Pop infix: "string", var.
- Tcl likes commands, not expressions. Tcl:
doit this that, [expr $this+$that]. Pop infix: doit("this","that"), this+that.
So basically, pop infix languages (and I use the term in the most non-judgmental, factual way), pop infix languages are optimized for programming (duh, they are programming languages). Programming is definitions. Define a variable and it will be easy to use it, and computing hairy expressions from variables is also easy. Tcl is optimized for usage. Most of the time, users give simple commands. Command names and literal parameters are easy. If you are a sophisticated user, and you want to do pmem 0 bkpt [expr [pmem 0 pc] + 1], go ahead and do it. A bit ugly, but on the other hand, simple commands are really, really simple.
And eventually, simple commands become all that matters for the user, because the sophisticated user grows personal shortcuts, which abstract away variables and expressions, so you end up with pmem 0 bkpt nextpc or something. Apparently, flat function calls with literal arguments is what interactive program usage is all about.
I’m not saying that I’m going to use Tcl as the extension language of my next self-made lovecraftian toolchain (I was thinking more along the lines of doing that one in D and using D as my scripting language, ’cause it compiles fast enough and it’s apparently high-level enough). I haven’t thought enough about this, but the grotesque escaping/quoting in Tcl still freaks me out; I don’t want to program like that. All I’m saying is that I like the interactive part. Specifically:
- Short code matters a lot; short interactive commands matter much more.
- An interactive command language must be a real language (loops, functions and all).
- Tcl allows for the shortest commands and it’s a real language. I’m fascinated.
Allow me to elaborate.
Short code vs short commands
Lots of people have noticed that keeping your code short is extremely important. More surprisingly, many people fail to notice this, probably because “1 line is better than 5″ doesn’t sound that convincing. OK, think about 100K lines vs 500K and you’ll get the idea. Oh, there are also those dirty Perl/shell one-liners that make one doubt about this. I’ve known a Bastard Programmer that used 2K bash one-liners as his weapon of choice. OK then, so the actual rule must be “short code is good unless it’s written by a bastard”. But it’s the same core idea.
So we have the Architect type, who loves lots of classes which delegate work to each other, and we have the Enlightened type, who wants to write and read less. And the Enlightened type can rant and rave all day how Python, or Ruby, or Lisp make it oh-so-easy to define data structure literals, or to factor out stuff using meta-programming, or some other thing an Architect just never gets. And I’m all with it.
And then we have interactive shells. And in Python it’s doit("xx","yy"). And in Lisp it’s (doit "xx" "yy"), or (doit :xx :yy), or (doit xx yy) if you make it a macro. And in Ruby it’s doit :xx :yy, if you use symbols and omit parens. And that’s about as good as you can get without using your own parser as in doit "xx yy", which can suck in the (more rare) case when you do need to evaluate expressions before passing parameters, and doesn’t completely remove overhead. Also note how all these languages use (), which makes you press Shift, instead of [] which doesn’t. Ruby and Perl let you omit (), but it costs in readability. And [] is unanimously reserved for less important stuff than function calls.
The whole point of short code is saving human bandwidth, which is the single thing in a computing environment that doesn’t obey Moore’s law and doesn’t double once in 18 months. Now, which kind of bandwidth is the most valuable? I’ll tell you which. It’s the interactive command bandwidth. That’s because (1) you interact a lot with your tools and (2) this interaction isn’t what you’re trying to do, it’s how you’re trying to do it, so when it isn’t extremely easy it’s distracting and extremely frustrating.
This is why an editor that doesn’t have short keyboard shortcuts for frequently used commands is a stupid fucking piece of junk and should go down the toilet right now. This is why a Matlab vector - [1 2 3] - is much better than a Python list - [1,2,3] (ever noticed how the space bar is much easier to hit than a comma, you enlightened dynamic language devotee? Size does matter). And don’t get me started about further wrapping the vector literal for Numeric Python.
The small overhead is tolerable, though sucky, when you program, because you write the piece of code once and while you’re doing it, you’re concentrating on the task and its specifics, like the language syntax. When you’re interacting with a command shell though, it’s a big deal. You’re not writing a program - you’re looking at files, or solving equations, or single-stepping a processor. I have a bug, I’m frigging anxious, I gotta GO GO GO as fast as I can to find out what it is already, and you think now is the time to type parens, commas and quotation marks?! Fuck you! By which I mean to say, short code is important, short commands are a must.
Which is why I never got to like IPython or IDLE. Perhaps Ruby could be better, because of omitting parens and all. Ruby seems to be less inflicted with the language lawyer pseudo-right-thing mindset. But the basic plain vanilla function-call-with-literal-args syntax still doesn’t reach the purity of *sh or Tcl. Well, the shell is an insanely defective programming language, so it’s not even an option for anything non-interactive. But Tcl gets way closer to a programming language. Which brings us to the next issue:
Ad-hoc scripting languages - the sub-Turing tar pit
Many debuggers have scripting languages. gdb has one, and Green Hills MULTI has one. Ad hoc command languages usually get the command-syntax-should-be-easy part right - it’s command arg arg arg… They then get everything else wrong. That is, you usually don’t have any or some of: data structures, loops, conditionals and user-defined functions, option for expression evaluation in all contexts, interface to the host OS, and all the stuff which basically would make the thing a programming language. Or you get all those things in a peculiar, defective form which you haven’t seen anywhere else.
I wish people stopped doing that. I understand why many people do that very well - they don’t know any language which isn’t a 3rd generation one (presumably C++ or Java). They don’t know how scripting works except on a theoretical level. They know how to build a big software system, with objects and relationships between objects and factories of objects and stuff. At the system/outside world boundary they’re helpless though. Outside of the system our objects are gone. There’s this cold, windy, cruel world with users and files and stuff. Gotta have an AbstractInputParser to guard the gates into our nice, warm, little system, um, actually it’s “big”, no, make it “huge” system.
These are the Architects who get mocked by the Enlightened dynamic language lovers. They normally dismiss scripting languages as “not serious”, therefore, when faced with the need to create a command language for their system, they start out with a plan to create a non-serious (a.k.a crippled) language. Even if they wanted to make it a good one, they never thought about the considerations that go into making a good scripting language, nor do they realize how easy/beneficial it is to embed an existing one.
So basically we have 3GL people, who realize that commands should be short (”it’s a simple thing we’re doing here”), but they don’t see that you need a real Turing-complete programming language for the complicated cases. And we have 4GL people, who optimize for the complicated case of programming (”what’s a scripting language - it’s a programming language, dammit!”), and they don’t care about an extra paren or quotation mark.
And then we have Tcl, which makes easy things really easy and scales to handle complicated cases (well, almost, or so I think). And not only does it make plain funcalls easy - it reserves [] for nested funcalls, in the Lispy prefix form of outercall arg [innercall arg arg] arg… [] is better than (). Pressing Shift sucks. And custom keyboard mapping which makes it possible to type parens without pressing Shift is complete idiocy, because you won’t be able to work with anyone’s machine. This shit matters, if you program all day long it does.
Now what?
I don’t know if I’d use Tcl. It’s less of a programming language than your typical pop infix 4GL. For starters, [expr] is a bitch. And then there are “advanced” features, like closures, that I think Tcl lacks. It has its interesting side from a “linguistic” perspective though. It has really few core syntax, making it closer to Lisp and Forth than the above-mentioned pop infix ilk. So you can use Tcl and claim for aristocracy. Of course you’ll only manage to annoy the best programmers this way; the mediocre won’t know what you’re talking about, seeing only that Tcl doesn’t look enough like C to be worth the name of a language.
I’d think a lot before embedding Tcl as a scripting language for my tools, because of linguistic issues and marketing issues (you ought to give them something close enough to C, whether they’re a customer or a roommate). So the practical takeaways for me are modest:
- I ain’t gonna mock Tcl-scriptable tools no more. I understand what made the authors choose Tcl, and no, it’s not just a bad habit. On a level, they chose probably the best thing available today in terms of ease-of-programming/ease-of-use trade-off (assuming they bundle an interactive command shell). If I need to automate something related to those tools, I’ll delve into it more happily than previously. So much for emotional self-tuning.
- I’ll let it sink, and try to figure out whether you have a better trade-off. For example, if Ruby had macros (functions which don’t evaluate their inputs), you could say doit x y without making x and y symbol objects, which forces you to prefix them with a colon. How macros of that sort should work in an infix language escapes me (not that I think that much about it, but still). Anyway, I’ll definitely add Tcl to the list of things I should understand better in order to fight my linguistic ignorance. Being an amateur compiler writer, that seems like one of my duties.
April 9th, 2008 — software
Today an important thing happened in my professional life. I was told to take a break from The Top Priority Project, so that I could deal with a more important project. The evaluation of the expression max_priority+1 caused my wetware registers to overflow. Therefore, you should consider the following piece of whining as an attempt of a brain to recover from deadline damage. That is, you won’t get the kind of deep discussions and intellectual insights you’ve come to expect from yosefk.com. Instead, you’ll get shallow, plain, simple and happy Python hate. That’s the kind of thing we have in stock right now.
It has been known for quite some time that a specie called the Idiot is out there, and its revolting specimens will mercilessly prey on your belief in the future of the human race. The time has come for me to share with you some of the shocking experience from my encounters with idiots. Believe it or not, there exists a kind of Idiot who sincerely thinks that a Person should not complain about Technology or Tools he or she gets to use, and that such complains are deeply wrong on this or other level, depending on the exact idiot in question. Well, I hereby inform the idiots, at least the ones who can use a browser and read text, that I (1) have a bad attitude, (2) am not a good workman, so (3) I’ll complain about any technology I use as I damn please and (4) I don’t give a flying fuck about your moronic opinion, so don’t bother to share it with me.
So, Python hate. I don’t really hate Python; it’s a figure of speech. You know what I really hate? Of course you do. I hate C++. C++ is the hate of my life. I don’t think I’ll ever be able to hate another programming language. For me, to hate means to recognize that something is inherently evil and should be exterminated. Reaching that status is outstanding achievement for any technology. Your typical piece of software never gets that far. All it does is it does something you need most of the time, and then does something extremely brain-damaged some of the time. So you kinda like it, but sometimes it just pisses you off. It’s gotta piss you off. It’s a machine. It’s doesn’t have a pigeon crapload worth of common sense. Of course it pisses you off. Don’t lie to me! A person who was exposed to machines and doesn’t hate them is either an idiot or is completely devoid of soul! Step back, the child of Satan!
You know what really pisses me off about Python? It’s the combination of being BDFL-managed and having roots in the noble idea of being The Language for Teaching. Sure, having lexical scoping wouldn’t hurt, and having name errors 5 minutes down the road in a program that happily parses already hurts, and Python shells aren’t a picnic, blah blah blah. But nothing is perfect, you know, and people adapt to anything. I learned to live in tcsh and vim as my IDE. So I know how adaptability can bring you quite far, sometimes much farther than you wish to get. But this BDFL+teaching combo really bugs the hell out of me. Allow me to elaborate.
Ever heard about programming languages with an ANSI standard? Sure you did. Well, the other kind of languages have a BDFL standard. It’s when a Benevolent Dictator For Life, typically the Punk Compiler Writer (PCW) who invented the language and threw together the initial (and frequently still the only practical) implementation, decides what happens with the language. I plan to blog about PCWs at some point, but this isn’t about PCWs in general, this is about PCWs who’ve been elevated to the BDFL rank. So they bend the language as they damn please. Sometimes it splits the user community (as in Perl 5 vs Perl 6) and sometimes it doesn’t (as in Python 2 and Python 3, or so it seems). I’d say that it’s totally stupid to use a BDFL-governed language, but I can’t, because that would offend C++, The Hate Of My Life, who does have an ANSI standard. Relax, darling. It’s you, and only you, that I truly hate. The others mean nothing to me.
So that’s what the “BDFL” part means. The “teaching” part is about Python being supposed to be a good (the best? of course, the best) language for teaching kids how to program. While I’m not sure whether the BDFL part is inherently wrong, the teaching part is sure as hell wrong, in two ways:
- Why the fuck should kids program?
- Why the fuck should I program in a language for kids?
I didn’t program till the age of 17, and I have absolutely no regrets about it. I know quite some other programmers-who’re-really-into-software-someone-call-for-help who didn’t hack in their diapers, either. I also know a bunch of people who did program since they were 10 to 12 years old. They are all burnt out. They’re trying to look for some other occupation. Theater, psychology, physics, philosophy, you name it. I haven’t gathered that many data points, just a couple dozens. Some people won’t fit the pattern I see. But it’s enough for me to assume that kids could have better things to do than programming. Watching cartoons is one thing that sounds like fun. That I always liked.
I’m not sure kids shouldn’t program. We need scientific data on that one. I’m no specialist, but locking a large enough set of kids in the basement and have them implement progressive radiosity sounds like a good start. Anyway, as you probably noticed, while I’m curious about the scientific truth, I don’t care much about kids. I care about me.
Me. I’m a professional programmer. By which I mean to say, I shovel through piles of virtual shit for a living. And to shovel quickly and happily, I need a big shovel. Python is one of my shovels. Core dumps are one of my piles of shit. Python, meet the core dumps. Core dumps, meet the Python.
Core dumps are spelled in hexadecimal. I mean “core dumps”, “spelled” and “hexadecimal” in the broadest sense, as in “things having to do with machine encoding and that sort of low-level crud are viewed and edited in tools that tend to emit and consume integers in hexadecimal notation”. Or “core dumps are spelled in hexadecimal” for short.
To deal with hexadecimal, Python has a hex() function and a ‘%s’ format string like it should. Well, almost. Python thinks that hexadecimal is like decimal, except in base 16 instead of 10. Integers are naturally spelled in sign magnitude format: 1, 10, -1, -10. In hexadecimal, that’s 0x1, 0xa, -0x1, -0xa. "-0x1"? "-0x1"?!! FUCK YOU, PYTHON!!
You see, all computers that have been out there since the year 19-something store integers in 2’s complement format. There used to be 1’s complement and sign magnitude machines, but they are all dead, which is a good thing, because 2’s complement is a good thing. In 2’s complement, -1 in hexadecimal is 0xffffffff. If you use 32-bit integers. And twice the number of f’s on 64-bit machines. Do I give a flying fuck about 64-bit machines? I think you know the answer. If I cared, I’d run Python on a 64-bit host. An old version of Python.
In old versions of Python, hex() worked about right. Python has ints and longs (that’s fixnum and bignum in Lisp, and it would be int and BigInteger in Java if ints didn’t overflow but would instead turn into BigIntegers when int wouldn’t have enough bits). hex() assumed that if you’re using ints, you’re a bit biter, and you want to get 0xffffffff for -1, and you should get your non-portable, but handy result. If, on the other hand, you used longs, you probably were a mathematician, so you’d get -0x1. And this was juuuust riiiight.
However, starting from version 2.something, hex() was “fixed” to always do “The Right Thing” - return the portable sign magnitude string. That, my friends, is what a kid would expect. You should know - you were a kid yourself, didn’t you expect just that?! Naturally, breaking the frigging backwards compatibility is perfectly OK with the PCW BDFL, who sure as hell doesn’t want to add another function, sex() for “sexadecimal“. That function could (preferably) do the new thing, and hex would do the old thing to not break the existing programs. Or it could (passably) do the old thing so that people could at least get the old functionality easily, even though hex() would no longer work. But noooo, we need just one hex function doing the one right thing. Too bad we only found out what that right thing was after all those years went by, but in retrospect, it’s obvious that nobody needs what we’ve been doing.
Now, could anybody explain me the value of hexadecimal notation outside of the exciting world of low-level bit fucking? Why on Earth would you want to convert your numbers to a different base? One particular case of this base brain rot is teaching kids to convert between bases. As Richard Feynman has pointed out, and he had the authority to say so and I don’t and hence the reference, converting between numbers is completely pointless, and teaching kids how to do this is a waste of time. So what you do is you give me, the adult programmer with my adult core dump problems, a toy that a kid wouldn’t need or want to play with. Thank you! You’ve found the perfect time, because I have a DEADLINE and this fucking shit CRASHES and I couldn’t possibly have BETTER THINGS TO DO than WASTING TIME ON CONVERTING BASES!!
I know that I can implement the right (yes, THE RIGHT, damn it) hex and put it into my .pythonrc and that would solve the interactive shell annoyances and I could carry it around and import it and that would solve the non-interactive annoyances, thanks again. Until now I’ve done it in a cheaper way though - I had a Python 2.3.5 binary handy, and that did the trick. 2 Pythons, one with new libraries and shiny metaprogramming crap and stuff, and one with a working hex(). I like snakes.
Why do I attribute this business to kid-teaching, of all things? I don’t lurk on Python forums, so I’m basing this on rumors I heard from die-hard Python weenies. Another, TOTALLY INFURIATING thing: a/b and a//b. int/int yields a frigging FLOAT in Python 3!! This has all the features of hex(-1) == ‘-0×1′:
- “Kids” get the wrong answer: 3/2 should give 1 (honest-to-God integer division) or it should give the ratio 3/2 (honest-to-God rational/real number division). Floating point is reeeaaally too complicated. I’ve seen a teacher saying online something along the lines of “if you ask how much is 3/2 times 2, how do you grade the answer 2.99999″? Bonus points if your reply managed to take the identity “2.999… = 3″ into account (hordes of grown-ups fail to get that one).
- “Programmers” get the wrong answer: chunks=len(arr)/n; rem=len(arr)%n. ’nuff said.
- “Mathematicians” get the wrong answer: I mean, if you’re doing numeric work in Python (what’s the matter, can’t pay for a Matlab license?), you probably know about integer division versus floating point division, but you’ve never heard about an operation called // (no, it’s not a BCPL comment). 3/2=1.5? How am I supposed to do integer division? int(3/2)? Argh!
You know, I’m actually more willing to upgrade to C++0x, whenever that materializes, than I am willing to upgrade to Python 3. At least it would be “for(auto p=m.begin()…” instead of “for(std::map<std::string,std::string>::const_iterator p=FUCK OFF AND DIE YOU EVIL PROGRAMMING LANGUAGE FROM HELL!!”.
What’s that? Who said I liked C++0x more than I liked Python 3?! Upgrade, I said, “willing to upgrade”! Don’t you see it’s relative to the previous version? In absolute terms, who can beat C++?! Come back, C++! I still hate you!!
March 21st, 2008 — software
My programming personality is very depressed right now. For a while, I’ve been away from my usual long-term stuff and back in the trenches. I’m doing simple things, but there’s lots of them, so it feels like I’m not moving anywhere even though I go as fast as I can. It’s like a highway with monotonic view.
Well, this has completely drained my brain. So I won’t tell you anything in this blog - instead, I’ll ask you a question. This way, it’s OK for me to be wrong, which is even more likely than usual in my current state. Clever trick, isn’t it?
My question basically is, how is not having side effects in your programming language going to help you write correct, and optimal, and parallel programs? I’ve already mentioned it, but this time, I’ll elaborate on the point that I fail to get.
The problem with parallelism is scheduling. We want scheduling to be:
- Correct: when A depends on B, B should happen before A.
- Optimal: A, B and everything else should complete as soon as possible.
So, in order to correctly schedule basic blocks of code called A and B, I need to know if they depend on each other. If I have side effects, this is going to be hard. A modifies an object called X, and B modifies Y. Can X and Y be the same thing? To figure it out, you need whole-program analysis, and if your language is Turing-complete, WPA won’t necessarily succeed, either.
This is the aliasing problem. The aliasing problem rules. It’s the problem that forces you to use restrict to optimize numeric code. It’s the problem that forces you to use static typing if you want to optimize anything at all. And it’s the problem that makes all safe languages completely and wildly unsafe when you have multiple threads. The rule of thumb is: if you have a good idea for compile-time or run-time anything, it won’t work because of the aliasing problem.
Now, suppose we got rid of those pesky side effects. A doesn’t modify X; instead, it creates a new object, Z. Similarly, B takes Y and makes W. Now we can prove that A doesn’t depend on B. As soon as their inputs are ready, they can run. This takes care of the correctness problem, which was listed as the scheduling problem #1 at the beginning.
However, this didn’t solve the aliasing problem - X and Y can still be the same. I suspect that the aliasing problem is going to raise its ugly head and kick our pretty butt. Specifically, while semantically we made Z from X, we’d really like to wipe out X and reuse its memory for hosting Z. Because that would be more efficient than copying X and modifying some of it and keeping two objects around. Now, we can’t do this optimization if X and Y are the same thing, and we know nothing about the execution order of A and B, can we? If A replaced X with Z, B would get the wrong Y.
And if we copy things, we aren’t going to get an optimal scheduling, because A and B now run slower - problem #2, right?
Seriously, what do I miss? I realize that there are parallel execution environments where you have to copy things anyway because you have a zillion processors and a zillion processors can’t have shared memory. I also realize that you can have few objects shared between tasks that run in parallel, so the overhead of copying them isn’t very important. I realize all kinds of things making the aliasing problem a non-problem for side-effect-free parallelization frameworks or languages or whatever.
What I fail to see is, if you run in a single-box, multi-core system, and you do have shared memory, and you do have lots of shared data, how does removing side effects help you? With side effects, you can get an efficient program, if you can handle the debugging. Without side effects, you get a correct program, and then you must somehow optimize object allocation. And I just don’t see how you’d go about that without compromising correctness.
Maybe I’m ignorant. Maybe I should have read about this somewhere. But where do I find a levelheaded discussion of, say, common implementation techniques of pure functional language compilers? I just keep bumping into “sufficiently smart compiler” arguments, and then there are people who don’t care much about performance (and shouldn’t because of their application domain).
Compare this to imperative languages. Anybody can write a pretty decent C compiler. If you don’t optimize at all, you get what, 4x slowdown compared to the state of the art optimizer? And on a scalar RISC machine, I bet you can bridge 50%-70% of that gap by your cheapest shot at register allocation. A no-brainer, you need about 30% of a comp sci degree and you’re all set. Now, how am I supposed to optimize allocation if there are no side effects? I mean, your spec or your family of specs should come with implementation techniques. “Useful” is good by itself, but it must also be implementable.
By the way, I’d love to find out that I’m wrong on this one. I’ve done lots of parallelism-related debugging, and it sucks. And maybe I am wrong. Maybe there just isn’t much talk about the implementation details of these things. I do think my reasoning makes sense though.
February 28th, 2008 — hardware
I’m going to argue that high-performance chip designs ought to use a (relatively) modest number of (relatively) strong cores. This might seem obvious. However, enough money is spent on developing the other kinds of chips to make the topic interesting, at least to me.
I must say that I understand everyone throwing millions of dollars at hardware which isn’t your classic multi-core design. I have an intimate relationship with multi-core chips, and we definitely hate each other. I think that multi-core chips are inherently the least productive programming environment available. Here’s why.
Our contestants are:
- single-box, single-core
- single-box, multi-core
- multi-box
With just one core, you aren’t going to parallelize the execution of anything in your app just to gain performance, because you won’t gain any. You’ll only run things in parallel if there’s a functional need to do so. So you won’t have that much parallelism. Which is good, because you won’t have synchronization problems.
If you’re performance-hungry enough to need many boxes, you’re of course going to parallelize everything, but you’ll have to solve your synchronization problems explicitly and gracefully, because you don’t have shared memory. There’s no way to have a bug where an object happens to be accessed from two places in parallel without synchronization. You can only play with data that you’ve computed yourself, or that someone else decided to send you.
If you need performance, but for some reason can’t afford multiple boxes (you run on someone’s desktop or an embedded device), you’ll have to settle for multiple cores. Quite likely, you’re going to try to squeeze every cycle out of the dreaded device you have to live with just because you couldn’t afford more processing power. This means that you can’t afford message passing or a side-effect-free environment, and you’ll have to use shared memory.
I’m not sure about there being an inherent performance impact to message passing or to having no side effects. If I try to imagine a large system with massive data structures implemented without side effects, it looks like you have to create copies of objects at the logical level. Of course, these copies can then be optimized out by the implementation; I just think that some of the copies will in fact be implemented straight-forwardly in practice.
I could be wrong, and would be truly happy if someone explained to me why. I mean, having no side effects helps analyze the data flow, but the language is still Turing-complete, so you don’t always know when an object is no longer needed, right? So sometimes you have to make a new object and keep the old copy around, just in case someone needs it, right? What’s wrong with this reasoning? Anyway, today I’ll assume that you’re forced to use mutable shared memory in multi-core systems for performance reasons, and leave this no-side-effects business for now.
Summary: multiple cores is for performance-hungry people without a budget for buying computational power. So they end up with awful synchronization problems due to shared memory mismanagement, which is even uglier than normal memory mismanagement, like leaks or dangling references.
Memory mismanagement kills productivity. Maybe you disagree; I won’t try to convince you now, because, as you might have noticed, I’m desperately trying to stay on topic here. And the topic was that multi-core is an awful environment, so it’s natural for people to try to develop a better alternative.
Since multi-core chips are built for anal-retentive performance weenies without a budget, the alternative should also be a high-performance, cheap system. Since the clock frequency doesn’t double as happily as it used to these days, the performance must come from parallelism of some sort. However, we want to remove the part where we have independent threads accessing shared memory. What we can do is two things:
- Use one huge processor.
- Use many tiny processors.
What does processor “size” have to do with anything? There are basically two ways of avoiding synchronization problems. The problems come from many processors accessing shared memory. The huge processor option doesn’t have many processors; the tiny processors option doesn’t have shared memory.
The huge processor would run one thread of instructions. To compensate for having just one processor, each instruction would process a huge amount of data, providing the parallelism. Basically we’d have a SIMD VLIW array, except it would be much much wider/deeper than stuff like AltiVec, SSE or C6000.
The tiny processors would talk to their neighbor tiny processors using tiny FIFOs or some other kind of message passing. We use FIFOs to eliminate shared memory. We make the processors tiny because large processors are worthless if they can’t process large amounts of data, and large amounts of data mean lots of bandwidth, and lots of bandwidth means memory, and we don’t want memory. The advantage over the SIMD VLIW monster is that you run many different threads, which gives more flexibility.
So it’s either huge or tiny processors. I’m not going to name any particular architecture, but there were and still are companies working on such things, both start-ups and seasoned vendors. What I’m claiming is that these options provide less performance per square millimeter compared to a multi-core chip. So they can’t beat multi-core in the anal-retentive performance-hungry market. Multiple cores and the related memory mismanagement problems are here to stay.
What I’m basically saying is, for every kind of workload, there exists an optimal processor size. (Which finally gets me to the point of this whole thing.) If you put too much stuff into one processor, you won’t really be able to use that stuff. If you don’t put enough into it, you don’t justify the overhead of creating a processor in the first place.
When I think about it, there seems to be no way to define a “processor” in a “universal” way; a processor could be anything, really. Being the die-hard von-Neumann machine devotee that I am, I define a processor as follows:
- It reads, decodes and executes an instruction stream (a “thread”)
- It reads and writes memory (internal and possibly external)
This definition ignores at least two interesting things: that the human brain doesn’t work that way, and that you can have hyper-threaded processors. I’ll ignore both for now, although I might come back to the second thing some time.
Now, you can play with the “size” of the processor - its instructions can process tiny or huge amounts of data; the local memory/cache size can also vary. However, having an instruction processing kilobytes of data only pays off if you can normally give the processor that much data to multiply. Otherwise, it’s just wasted hardware.
In a typical actually interesting app, there aren’t that many places where you need to multiply a zillion adjacent numbers at the same cycle. Sure, your app does need to multiply a zillion numbers per second. But you can rarely arrange the computations in a way meeting the time and location constraints imposed by having just one thread.
I’d guess that people who care about, say, running a website back-end efficiently know exactly what I mean; their data is all intertwined and messy, so SIMD never works for them. However, people working on number crunching generally tend to underestimate the problem. The abstract mental model of their program is usually much more regular and simple in terms of control flow and data access patterns than the actual code.
For example, when you’re doing white board run time estimations, you might think of lots of small pieces of data as one big one. It’s not at all the same; if you try to convince a huge SIMD array that N small pieces of data are in fact one big one, you’ll get the idea.
For many apps, and I won’t say “most” because I’ve never counted, but for many apps, data parallelism can only get you that much performance; you’ll need task parallelism to get the rest. “Task parallelism” is when you have many processors running many threads, doing unrelated things.
One instruction stream is not enough, unless your app is really (and not theoretically) very simple and regular. If you have one huge processor, most of it will remain idle most of the time, so you will have wasted space in your chip.
Having ruled out one huge processor, we’re left with the other extreme - lots of tiny ones. I think that this can be shown to be inefficient in a relatively intuitive way.
Each time you add a “processor” to your system, you basically add overhead. Reading and decoding instructions and storing intermediate results to local memory is basically overhead. What you really want is to compute, and a processor necessarily includes quite some logic for dispatching computations.
What this means is that if you add a processor, you want it to have enough “meat” for it to be worth the trouble. Lots of tiny processors is like lots of managers, each managing too few employees. I think this argument is fairly intuitive, at least I find it easy to come up with a dumb real-world metaphor for it. The huge processor suffering from “lack of regularity” problems is a bit harder to treat this way.
Since a processor has an “optimal size”, it also has an optimal level of performance. If you want more performance, the optimal way to get it is to add more processors of the same size. And there you have it - your standard, boring multi-core design.
Now, I bet this would be more interesting if I could give figures for this optimal size. I could of course shamelessly weasel out of this one - the optimal size depends on lots of things, for example:
- Application domain. x86 runs Perl; C6000 runs FFT. So x86 has speculative execution, and C6000 has VLIW. (It turns out that I use the name “Perl” to refer to all code dealing with messy data structures, although Python, Firefox and Excel probably aren’t any different. The reason must be that I think of “irregular” code in general and Perl in particular as a complicated phenomenon, and a necessary evil).
- The cost of extra performance. Will your customer pay extra 80% for extra 20% of performance? For an x86-based system, the answer is more likely to be “yes” than for a C6000-based system. If the answer is “yes”, adding hardware for optimizing rare use cases is a good idea.
I could go on and on. However, just for the fun of it, I decided to share my current magic constants with you. In fact there aren’t many of them - I think that you can use at most 8-16 of everything. That is:
- At most 8-16 vector elements for SIMD operations
- At most 8-16 units working in parallel for VLIW machines
- At most 8-16 processors per external memory module
Also, 8-16 of any of these is a lot. Many good systems will use, say, 2-4 because their application domain is more like Perl than FFT in some respect, so there’s no chance of actually utilizing that many resources in parallel.
I have no evidence that the physical constants of the universe make my magic constants universally true. If you know about a great chip that breaks these “rules”, it would be interesting to hear about it.
February 24th, 2008 — software
Yeah, naming conventions. Looks like my brain won’t do any better today; those 5 drafts will have to wait. If you aren’t in a mood for a trivial subject, skip this.
I think that the best naming convention out there is the Lisp one: case-insensitive-dash-separated. It just doesn’t get any better:
- You never have to hit Shift or Caps Lock in the middle of a name, which makes typing easier. This is especially important for names because you use auto-completion with names. Auto-completion requires you to press things like Ctrl-Space or Alt-Slash or Ctrl-P. Together with another Shift needed for the name to be completed correctly, auto-completion is much more likely to cause repetitive strain injury.
- You never have to think about the case. Figuring out the case of a letter in a case-sensitive naming convention can be non-trivial; more on that later.
- The dash-as-a-separator-convention is used in English, so your names look natural.
Unfortunately, most languages use C-style identifiers for names, the dreaded [A-Za-z_][A-Za-z_0-9]*, because their infix parsers can’t tell a dash from a minus. So you can’t use this convention.
This leads to two problems:
- How do we separate between subsequent words in an identifier?
- When do we capitalize letters?
Of course, we could use a lowercase_underscore_separated convention. It would solve both problems in a simple way, having all the benefits of the Lisp convention except for the no-Shifts-and-Caps-Locks part. But (1) Caps Lock is available for capital letters, but not for underscore, and Shift is less healthy for your hands, and (2) if we have case sensitivity in our language, we’ll of course use it, won’t we? OK, let’s kill those underscores.
There are two anti-underscore schools: alllowercase and CamelCase. alllowercase looks lame - it makes it easy to know when to capitalize letters (never), but chooses to ignore the word separation problem completely. I used to sneer at it. However, it has two huge benefits: it’s very typing-friendly, and it discourages the use of long names. Long names, people, are a frigging nightmare.
HaveYouEverSeenANameTakingHalfAScreen? This is awful. Awful!! I can’t lock my eyes on the damned thing. I can only focus on a tiny part of it. My eyes nervously jump around the line, which mentions the moronic mega-identifier twice (at both parts of an assignment). I’m looking for differences, small differences in the names… You know, it could be BlahBlahABlah on the left and BlahBlahBBlah on the right… AAAARGH!
Reading this kind of code is pure mental pain. I prefer mental pain to physical pain on any day, and that’s why I’m in the software industry, but still, this sucks. The good news are that alllowercasenametakinghalfascreen is so ridiculous that even the most clueless pseudo-orderly person won’t emit it.
Now, CamelCase, which is basically the winner, because it’s used in all major languages and libraries, is probably the worst possible naming convention. It fails to solve both problems created by the lack of a good word separator in the A-ZA-Z0-9 languages:
- You don’t really know when one word ends and the next word starts.
- You don’t really know when a letter should be capitalized.
The problems of camel case come from using capital letters for word separation. This interferes with the other uses of case in natural language. The problems are amplified by the brilliant idea to assign even more semantical payload to case: functionsLookLikeThis, but ClassesLookLikeThis, etc. Let’s look at some examples.
English has words like TCP, DNA and WTF. Should a TCP socket class be called TCPSocket or TCPsocket? What about a TCPIPSocket? What if we need a tcpOpen method - should we call it TCPOpen, like a class, to preserve the natural case of an acronym, or should it be TCPopen, so that the lowercase “o” conveys the fact that it’s a function?
Oh, I know, it should be “openTCP”! No, no, why are you using “openTcp” - this is ugliness for its own sake! The only important thing is to get the first letter of a name right, and then you can use natural capitalization! Unless, of course, it’s “openTCPIPSocket”, and then we have a problem again. “openTcpIpSocket”?.. Some people just can’t handle it and resort to underscores: openTCP_IPsocket, open_TCP_IP_socket… It’s no use. It’s ugly no matter what you do.
Capital letters coming from the natural language, like those in acronyms and names of people, are the smaller part of the problem - Tcp looks ugly, but you know what it means. The other part of the problem is the capital letters coming from formal languages, such as mathematical notation.
For example, in computer vision it’s common to denote 3D coordinates with uppercase X,Y,Z, and 2D coordinates with lowercase x,y. In a case-sensitive language, it’s damn natural to follow this convention, and it works very well for a local variable X or x (including the case when you use both in the same function). It doesn’t work so well when you try to name functions or classes after their arguments/coordinate systems.
Does xySomething start with a lowercase x because it’s a function, or because it really accepts x values of 2D coordinates? What about xYSomething - is the Y capitalized because “y” is a word and we always capitalize the first letter of a word, or maybe the function expects Y values of 3D coordinates?
You can have a function working with 3D X coordinates and 2D y coordinates, you know. I think it’s better to call it XySomething than xYSomething, because meaning is more important than convention. But did the author of the function think so, too? Of course, we can use an underscore to “clarify” the intent: something_Xy. The underscore clearly shows that the part after the underscore doesn’t follow standard naming conventions, so it must be according to the computer-vision-specific convention.
So what happens is that CamelCase code deteriorates to the following state:
- You have ugly names like tcpIpOpen.
- Since you also have names like TCP_IP_Open, your real naming convention is “camel case with underscores”. Which is equivalent to “any identifier that compiles”.
Maybe there’s a good way to augment CamelCase with rules that make it work well. I probably wouldn’t know. I ought to say that I’m not that good at naming conventions in particular and in Best Practices in general. But I doubt there’s a good case-sensitive naming convention out there.
Just look at the Python naming convention. You basically have everything. thingslikethis, ThingsLikeThis, things_like_this, thingsLikeThis, and they’re all attached to different types of object (module, class, function, method). And every time your language entity convention disagrees with the common sense (class TCPIPSocket), you’ve got yourself an ugly name. And in a way, this is a good convention, because it at least tries to be consistent with the common conventions used in C, C++ and Java.
The annoying part of this is the slowdown. “Um, how should I spell this name?..” There are actual capitalization trade-offs here. Programming is almost exclusively about making decisions and choosing trade-offs. It’s quite tiring, really. Nobody wants to be making some more pointless decisions on the way just for the fun of it. Maybe it’s just me and the kind of people I’ve worked with, but I’ve always, always bumped into lots and lots of names which looked like a compromise. Somebody was thinking hard here. And it looks ugly anyway.
Barf.
February 16th, 2008 — software
“Are code and data the same thing?” I haven’t conducted a poll, but I think the following answers are the most popular ones:
- “What?” (I don’t know about universal Turing machines)
- “Sure!” (I know about universal Turing machines)
My answer is “No”. I’ll now try to explain briefly why my opinion makes sense in general. After that, I plan to get to the point, which is how the code/data distinction matters in interactive programming environments.
I think that, um, everything is data, at least everything I can think about. I mean, the only things that can technically enter my brain are either data or cigarette smoke, because I don’t do drugs. And I hope that the effect of passive smoking is negligible, so it’s just data.
In particular, code is data. But not all data is code. Code is a special kind of data, that looks like this:
- There are lots of blocks.
- Each block defines the value of something.
- The blocks depend on each other, and the dependencies can be cyclic.
What this means, and of course everybody knows it, is that you can’t make any sense of code in the general case. That is, the only way to compute the values defined by the blocks of code is to “run” the code - keep chasing the links between the blocks, computing the values they define as you go. You can’t even prove that this process will terminate given an arbitrary bulk of code, not to mention proving its correctness.
Now, an image, for example, isn’t code. Well, philosophically, it is, because if they show you an image and it’s really ugly, you’ll say “ewww”. So the image was in fact a program giving instructions to your brain. The output of your brain’s image compiler is the following code in the human body assembly language:
MOV R0, dev_mouth
MOV R1, disgust_string
JMP write
RET
disgust_string:
.asciz "ewww"
More interestingly, you can write a program that processes images, and this particular image may be the one that makes your program so confused that it never terminates. However, this doesn’t mean that the image itself is “code”. The image doesn’t have interconnected blocks defining values. Even if the image is a screenshot of code.
An image is a two-dimensional array of pixels, a nice, regular data structure. You don’t have to “run” it in order to do useful things with it, like computing its derivatives or e-mailing it to your friends so they’ll go “ewww”. And programs doing that can be proven to terminate, unless you have an infinitely slow connection to the outgoing mail server.
So what I’m saying is, code is a special kind of data, having blocks which define values and depend on each other. Does it really matter whether a particular piece of data is “code” according to this definition? I think it does. One reason is the above-mentioned fact that you can’t really make sense of code. Many people realize the practical drawbacks of this, and so in many contexts, they use data-driven programming instead of the arguably more natural “code-driven” programming.
Everything you represent as “data” can be processed by many different programs, which is good. Everything you represent as “code” can only be processed by a kind of interpreter, which is bad. I’m not talking about the difficulty of parsing the syntax, which doesn’t exist with Lisp or Forth, isn’t a very big deal with C or Java and is a full-featured nightmare with C++ or Perl. I’m talking about the semantics - for many purposes, you can’t really “understand” what you’ve parsed without running it, and this is common to all Turing-complete languages.
But this isn’t going to be about the inability to analyze code. This is going to be about the somewhat more basic problem with code - that of blocks which point to each other. In order to explain what I mean, I’ll use the example of 3 interactive programming environments - Matlab, Unix shells and Python, listed in decreasing order of quality (as interactive environments, not programming languages).
Interactive programming is the kind of programming where the stuff you define is kept around without much effort on your behalf. The other kind of programming is when you compile and run your code and it computes things and exits and they are gone. Clearly interactive programming is nicer, because it makes looking at data and trying out code on it easy.
Or so it should be; in practice, it looks like more people prefer “batch programming”, so there might be some drawbacks in the actual interactive environments out there. What makes for a good interactive environment, and what spoils the fun? Let’s look at some well-known gotchas with existing environments.
Some of the most upset people I’ve seen near computers were the ones that had a Matlab session running for months when their machine crashed. It turned out that they had a load of data there - measurements, results of heavy computations, symbolic equations and their solutions - and now it’s all gone. GAAA!! This doesn’t happen with batch programming that much, because you send the output of programs to persistent storage.
This problem, nasty as it may be, looks easy to fix - just have the system periodically save the workspace in the background. Perhaps Matlab already has this. I wouldn’t know, because I tend to manually save things once in a few minutes, since my childhood trauma of losing a file I loved. Anyway, this doesn’t look like an inherent problem of interactive computing, just an awfully common implementation problem. For example, do Unix shells, by default, save the command history of each separate concurrent session you run? I think you know the answer.
Speaking of Unix shells. Ever had the pleasure of typing “rm -rf *” in the wrong directory because of command completion from history? GAAA!! OK. Ought to calm down. Let’s do Fault Analysis. Why did this happen? The command string with “rm” in it is, basically, code; shell code invokes processes. This code depends on another piece of code, the one that determines the current directory. The command string doesn’t have a fixed meaning - you must run getcwd in order to figure it out.
The shell couldn’t really warn us about the problem, either. That’s because the meaning of “rm” is defined by the code at /bin/rm (or by some other program in the $PATH which happens to be called “rm”). Since the shell can’t understand that code without running it, it doesn’t have an estimation of the potential danger. And if the shell warned us about all commands completed from history that originally ran in a different directory than the current one, the completion would likely be more annoying than useful.
At some point I’ve got fed up with Unix shells, and attempted to switch to a Python shell. I tried IPython and pysh, and I still use IDLE at home on my XP box. I ought to say that Python shells suck, and I don’t just mean “suck as a replacement for a Unix shell”, but also “suck as a way to develop Python code”. The single biggest problem is that when you change your code, you must reload modules. It’s unclear which modules should be reloaded, there’s no way to just reload everything, and ultimately you end up with a mix of old code and new code, which does something, but you aren’t quite sure what exactly.
Die-hard Pythonistas refuse to acknowledge there’s a problem, though they do bend their programming style to work around it. What they do is they write all of their code in one file, and use execfile instead of import to make sure everything is indeed redefined, the Zen of Python with its love of namespaces be damned. Sure, an interesting project in Python can be just 5000 lines worth of code, but I don’t like to navigate in a file that big. And sometimes you do need more lines, you know.
Another thing they do is implement __repr__ in their classes so that print displays their objects, and they’ll invest a lot of effort into making eval(repr(obj)) work. The fact that eval’able strings aren’t necessarily the most readable way to produce debug prints doesn’t seem to bother them. Nor do the contortions they have to go through to solve the prosaic problem of making references to other objects display reasonably. One way to do it is to use dictionary keys instead of pointers, so that member object references aren’t expanded into a full object description when they are printed. If you don’t know why they’re doing this, you’ll find their code fairly puzzling.
I find the struggle to make interactive Python programming work very depressing. It reminds me of me, before the invincible idiocy of C++ crushed my spirit. People have a tendency to assume that things are well thought-out and hence should work.
We have this extremely widespread language called C++, and it’s centered around type-based static binding. And it’s easy to see how this could help a compiler spot errors and optimize the code. Therefore, this programming style can be a good way of writing software, if applied consistently. Ha!
We have this Python language, and several shells for it. Quite obviously, interactive programming is a good way to speed up the development cycle. Therefore, adapting our Python code for interactive programming will pay off, if we do it consistently. Ha!
But I digress. This isn’t about the trusting nature of software developers, nor is it a comparison between C++ and Python, mind you. They’re hard to compare, since they are very different beasts: Python is a programming language, and C++ is a karmic punishment. So I should get back to the topic of interactive programming.
Here’s my opinion on the example programming environments I used in this entry.
Matlab is a great one, unless you lose your workspace. I used it for a while several times and it just never itched, and nothing went wrong.
Unix shells are good in terms of their ability to preserve your data (everything is a flat, self-contained string of bytes). I’d love them if they didn’t suck so badly as programming languages. Since they do, I only use shell scripting for one-shot throwaway things, like debugging (fiddling with log files and core dumps).
Python is awful. So when I’m on Unix, I run Python processes from the shell, and sometimes use Python’s reflection to make my batch programming just a bit more interactive. For example, if you have a Python function f(a,b,c), you can have your command line parser introspecting its arguments and adding the command line options -a, -b and -c.
So much for specific examples. What’s the generic rule? I think it’s this: pointer-happy systems can’t be interactive. That’s because interactive programming is about saving your data objects. And this is only useful when the current value of a preserved object is clear to you. Otherwise, you can’t use the object, so what’s the point?
When you have pointers in your objects, the objects aren’t self-contained, and when the pointed objects are redefined, it isn’t clear what should happen with the pointing objects. Should they point to the new thing or the old thing? Either answer can be counter-intuitive to you, and the whole point of interactive programming is to let you enter a state of flow, and if you scratch your head and can’t easily guess what the old object means right now, you aren’t in a state of flow.
In particular, pointers to code are the worst kind of pointers, because code is the most intertwined data of your program, and a pointer to a single block of code basically points to the entire code base. When an object points to an old function, and the function was redefined, and the system keeps the old definition around, you may easily get a call sequence with both the new function and the old function, which is probably ridiculous. And if you make the old object point to the new function, the function might simply fail to work with that object, and you just can’t tell whether it will work or not without running it, remember?
For example, Python is a good interactive calculator, because arithmetic expressions are self-contained. Even if they do contain references to variables, it’s fairly clear what happens when you change a variable - all expressions mentioning it will now recompute differently. Note that arithmetic expressions aren’t Turing-complete and can’t have cyclic references. Now, if you use Python’s object-oriented features, then you have objects which point to their class definition which is a bunch of code pointers, and now when you reload the module defining the class, what good are your old objects?
This is why I don’t believe in Numeric Python. The whole point of using Python is to use its pointer-happy features, like classes and hairy data structures and dynamically defined functions and stuff. Numeric programming of the kind you do in Matlab tends to use flat, simple objects, which has the nice side-effect of making interactive programming work very well. If you use a numeric library inside a pointer-happy language like Python, quite soon the other libraries you use will make interactive programming annoying. So you’ll either move to batch programming or suffer in denial like the die-hard Python weenie you are. Someone using Matlab will be better off, since interactive programming is more productive than batch programming, when it works well.
So at the bottom line, I think that interactive programming has limited applicability, since “general-purpose” programming environments pretty much have to be pointer-happy. That is, if a language doesn’t make it very easy to create a huge intertwined mess of code and data pointers, I don’t see how it can be usable outside of a fairly restricted domain. And even in the “flat” environments like Matlab or Unix, while old data objects can be useful, old commands are, and ought to be, a two-edged sword. Because they are code, and code is never self-contained and thus has a great potential to do the wrong thing when applied in a new context.
This whole claim is one of those things I’m not quite sure about. From my experience, it took me quite some time to realize which interactive features help me and which get in the way with each environment I tried. So I can’t know what happens in Lisp or Smalltalk or Tcl or Excel or Emacs, in terms of (1) applicability to “general-purpose” tasks, (2) the amount of self-contained data compared to the kind with pointers, especially pointers to code and (3) the extent to which the thing is itchy and annoying at times. So comments are most welcome. In particular, if you know of an environment that, put simply, isn’t more itchy than Matlab but isn’t less general-purpose than Python, that would be very interesting.
February 12th, 2008 — hardware, software
There’s a very influential platform called the AVM, which stands for Algorithmic Virtual Machine. That’s the imaginary device people use as their mental model of a computer. In particular, it’s used by many people working on algorithms where performance matters. Performance matters in many different contexts, ranging from huge clusters processing astronomic amounts of data to modest applications running on pathetically weak hardware. However, I believe that the core architecture of the AVM is basically the same everywhere.
AVM application development is done using the ubiquitous AVM SDK - a whiteboard and a couple of hands for handwaving. An AVM application consists of a set of operations your algorithm needs executed. Each operation has a cost (typically one cycle, sometimes more). You can then estimate the run time of your algorithm by the clever technique of summing the cost of all operations.
These estimations are never close enough to the real run time. The definition of “close enough” varies; the quality of estimations, by and large, doesn’t. That is, I claim that your handwavy AVM-derived estimation will fail to meet your precision requirements no matter what those requirements are. Apparently our tolerance for errors grows with the lack of understanding of the problem, but it never grows enough. But I’m not really sure about this theory; I’m only sure about AVM-estimations-suck part. Here’s why.
The AVM is basically this imaginary machine that runs “operations”. Here are some things that real machines must do, but the AVM doesn’t:
- Fetching instructions
- Fetching operands
- Testing for conditions
- Storing results
Basically, the Algorithmic Virtual Machine developers concentrate on “operations” and ignore addressing, branches, caches, buses, registers, pipelines, and all those other gadgets which are needed in order to dispatch the operation. In fact, that’s how I currently distinguish between people who write software to get a job done and people who think of software as their job. “People who program” are into operations (algebra, networking, AI); “programmers” are into dispatching (programming languages, operating systems, OO). This is about mental focus rather than aptitude. I haven’t noticed that people of either group are inherently less productive than the other kind.
When they’re after performance, the “operations” people will naturally look for a way to reduce the number of operations. Sometimes, they’ll find an algorithm with a better asymptotic complexity - O(N+M) instead of O(N*M). At other times, they’ll come up with a way to perform 4*N*M operations instead of 16*N*M. Both results are very significant - if M and N are the only variables. The trouble is that you can’t see all the variables if you just look at the math (as in “we want to multiply and sum all these and then compare to that”). That way, you assume that you run on the AVM and leave out all the dispatching-related variables and get the wrong answer.
Is there a way to take the cost of dispatching into account? Not really, not without implementing your algorithm and measuring its performance. However, families of machines do have related sets of heuristics that can be used to guess the cost of running on them. For example, here are a couple of heuristics that I use for SIMD machines (they are relevant elsewhere, but their relative importance may drop):
- Bandwidth is costly.
- Addressing is costly.
These heuristics are vague, and I don’t see a very good way to make them formal. Perhaps there isn’t any. To show that my points have any formal significance, I’d have to formally prove that there’s unavoidable intrinsic cost to some things no matter how you build your hardware. And I don’t know how to go about that. So what I’ll do is I’ll give some examples to show what I mean, and leave it there.
Bandwidth
Consider two “algorithms” (probably too fancy a name in this context): computing dot product, and computing its partial sums (Matlab: sum(a .* b) and cumsum(a .* b)). Exactly the same amount of “operations” - N multiplications and N additions. Many people with BA, MSc and PhD degrees in CS assume that the run time is going to be the same, too. It won’t, because sum only produces one output, and cumsum produces N outputs. Worse, if the input vector elements are 8-bit integers, we probably need at least 32 bits for each output element. So we generate N*4 bytes of output from N*2 input bytes.
At this point, some people will say “Yeah, memory. Processors are fast, memories are slow, sure, memory is a problem”. But it isn’t just about the memory; memory bandwidth is just one kind of bandwidth. Let’s look at the non-memory problems of the partial-sums-of-dot-product algorithm. On the way, I’ll try to show how the “bandwidth costs” heuristic can be used to guess what your hardware can do and what the performance will be.
Consider a machine with a SIMD instruction set. Most likely, the machine has registers of fixed width (say, 16 bytes), and each instruction gets 2 inputs and produces 1 output. Why? Well, the hardware ought to support 2 inputs and 1 output to do basic math. Now, if it also wants to have an instruction that produces, say, 4 outputs, then it needs to have 3 additional output buses from the data processing units to the register file. It also needs a multiplexer so that each of the 4 outputs can be routed to each of its N registers (N can be 16 or 32 or even 128). The cost of multiplexers is, roughly, O(M*N), where M is the number of inputs and N is the number of outputs. That’s awfully costly. Bandwidth costs. So they probably use 2 inputs and 1 output everywhere.
Now, suppose the machine has 16 multipliers, which is quite likely - 1 multiplier for each register byte, so we can multiply 16 pairs of bytes simultaneously. Does this mean that we can then take those 16 products and compute 16 new partial sums, all in the same cycle? Nope, because, among other things, we’d need a command producing 16×4 bytes to do that, and that’s too much bandwidth. Are we likely to have a command that updates less than 16 accumulators? Yes, because that would speed up dot products, and dot products are very important; let’s look at the manual.
You’re likely to find a command updating - guess how many? - 4 accumulators (32 bits times 4 equals 16 bytes, that’s exactly one machine register). If the register size is 8 bytes, you’ll probably get a command updating 2 accumulators, and so on. Sometimes the machine uses “register pairs” for output; that doubles the register size for output bandwidth calculation purposes. The bottom line is that instruction set extensions can speed up dot product to an extent impossible for its partial sums. You might have noticed another problem here, that of the dependency of a partial sum on the previous partial sum. Removing this dependency doesn’t solve the bandwidth problem. For example, consider the vertical projection of point-wise multiplication of 2 8-bit images, which has the same not-enough-accumulators problem.
There is little you can do about the bandwidth problem in the partial sums case - the algorithm is I/O bound. Some algorithms aren’t, so you can optimize them to minimize the cost of bandwidth. For example, matrix multiplication is essentially lots of dot products. If you do those dot products straightforwardly, you’ll have a loop spending 2 commands for loading the matrix elements into registers, and one command for multiplying and accumulating (MAC). 2 loads per MAC means an overhead of 200%.
However, you can work on blocks - 4 rows of matrix A and 4 columns of matrix B, and compute the 4×4=16 dot products in your loop. That’s 4+4=8 loads per 16 MACs; the overhead dropped to 50%. If you have enough registers to do this. And it’s still quite impressive overhead, isn’t it? Your typical AVM user would be very disappointed. (Yes, some machines can parallelize the loads and the MACs, but some can’t, and it’s a toy example, and stop nitpicking). BTW, blocking can be used to save loads from main memory to cache just like we’ve used it to save loads from cache to registers.
OK. With partial sums of dot product, the bandwidth problem kills performance, and with matrix multiplication, it doesn’t. What about convolution, which is about as basic as our previous examples? Gee, I really don’t know. It’s tricky, because with convolution, you need to store intermediate results somewhere, and it’s unclear how many of them you’re going to need. The optimal implementation depends on the quirks of the data processing units, the I/O, and the filter size. If you come across a benchmark showing the performance of convolution on some machine, you’ll probably find interesting variations caused by the filter size.
So we have a bread-and-butter algorithm, and non-trivial & non-portable performance characteristics. I think it’s one indication that your own less straightforward algorithm will also perform somewhat unpredictably. Unless you know an exact reason for the opposite.
Addressing
Bandwidth is one problem with fetching operands and storing results. Another problem is figuring out where they go. In the case of registers, we have costly multiplexers for selecting the source and destination registers of instructions. In the case of memory, we have addresses. Computing addresses has a cost. Reading data from those addresses also has a cost. Some address sequences are costlier than others from one of these perspectives, or both.
The dumbest example is the misalignment problem. People who learned C on x86 are sometimes annoyed when they meet a PowerPC or an ARM or almost any other processor since it won’t read a 32-bit integer from a misaligned address. So when you read a binary buffer from a file or a socket, you can’t just cast the char* to an int* and expect it to work. Isn’t it nice of x86 to properly handle these cases?
Maybe it’s nice, maybe it isn’t (at least if it failed, the code would be fixed to become legal C), but it sure is costly. The fact that it’s “in the hardware” doesn’t make it a single-cycle operation. If your address is misaligned, the 32 bits may reside in two different memory words (no matter what the word size is). The hardware will have to read the low word, and then read the high word, and then take the high bits of the low word and the low bits of the high word and make a single 32-bit value out of them. Because in one cycle, memories can only fetch one word from an aligned address.
Does it matter outside of I/O-related code using illegal pointer-casting? Consider the prosaic algorithm of computing the first derivative of a vector, spelled v(2:end)-v(1:end-1) in Matlab. If we run on a SIMD machine, we could execute several subtractions simultaneously. In order to do that, we need to fetch a word containing v[0]…v[15] and a word containing v[1]…v[16] (both zero-based). But the second word is misaligned. The handling of misalignment will have a cost, whether it’s done in hardware or in software.
Well, at least the operands of subtraction live in subsequent addresses - 0,1,2…15 and 1,2,3…16. That’s how data processing units like them: you read a pack of numbers from memory and feed them right into the array of adders, ready to crunch them. It’s not always like that. Consider scaling: a(x) = b(s*x+t). This can be used to resize images (handy), or to play records at a different speed the way you’d do with a tape recorder (less handy, unless you like squeaky or growly voices).
Now, if s isn’t integral (say, s=0.6), you’d have to fetch data from places such as s*x+t = 1.3, 1.9, 2.5, 3.1, 3.7... Suppose you want to use linear interpolation to approximate a(1.3) as a(1)*0.7+a(2)*0.3. So now we need to multiply the vector of “low” elements - a([1,1,2...]) - by the vector of weights - [0.7,0.1,0.5...] - and add the result to the similar product a([2,2,3...])*[0.3,0.9,0.5...]. The multiplications and the additions map nicely to SIMD instruction sets; the indexing doesn’t, because you have those weird jumpy indexes. So this time, the addressing can become a real bottleneck because it can prevent you from using SIMD instructions altogether and serialize your entire computation.
Well, at least we access adjacent elements. This means that most memory accesses will hit the cache. When you bump into an element that isn’t cached yet, the machine will bring a whole cache line (say, 32 bytes), and then you’ll read the other elements in that cache line, so it will pay off. You can even issue cache prefetching instructions so that while you’re working on the current cache line, the machine will read the next one in the background. That way, you’ll hit the cache all the time, instead of having your processor repeatedly surprised (hey, I don’t have a(32) in the cache!.. hey, I don’t have a(64) in the cache!.. hey, I don’t have…). Avoiding the regularly scheduled surprise can be really beneficial, although cache prefetching is truly disgusting (it’s basically a very finicky kind of cooperative multi-tasking - you ought to stuff the prefetching commands into the exactly right spots in your code).
Now, consider a(x) = b(f(x)) - a generic transformation of an input vector given a function for computing the input coordinate from the output coordinate. We have no idea what the next address is going to be, do we? If the transformation is complicated enough, we’re going to miss the cache a lot. By the way, if the transformation is in fact simple, and the compiler knows the transformation at compile time, the compiler is still very unlikely to generate optimal cache prefetching commands. Which is one of the gazillion differences between C++ templates and “machine-optimal” code.
DVMs and TVMs
My bandwidth and addressing heuristics don’t model a real machine; they only model an upgrade to the AVM for SIMD machines. Multi-box computing is one example of an entire universe of considerations they fail to model. So what we got is a DVM - Domain-specific Virtual Machine.
Now, in order to estimate performance without measuring (which is necessary when you choose your optimizations - you just can’t try all the different options), I recommend a TVM (Target-specific Virtual Machine). You get one as follows. You start with the AVM. This gives overly optimistic performance estimations. You then add the features needed to get a DVM. This gives overly pessimistic estimations.
Then, you ask some low-level-loving person: “What are the coolest features of this machine that other machines don’t have?” This will give you the capabilities that the real processor has but its DVM doesn’t have. For example, PowerPC with AltiVec extensions is basically a standard SIMD DVM plus vec_perm. I won’t talk about vec_perm very much, but if you ever need to optimize for AltiVec, this is the one instruction you want to remember. It solves the indexing problem in the scaling example above, among other things. Using a SIMD DVM and forgetting about vec_perm would make AltiVec look worse than it really is, and some algorithms much more costly than they really are.
And this is how you get a TVM for your platform. The resulting mental model gives you a fairly realistic picture, second only to reading the entire manual and understanding the interactions of all the features (not that easy). And it definitely beats the AVM by… how do you estimate the quality of handwaving? OK, it beats the AVM by the factor of 5, on average. What, you want a proof? Just watch the hands go.
February 2nd, 2008 — hardware, software
This is a follow-up on the previous entry, the “high-level CPU” challenge. I’ll try to summarize the replies and my opinion on the various proposals. But first, a summary of my original points:
- “Very” high-level languages have a cost. Attributing this cost to the underlying hardware architecture is wrong. You could move the cost from software to hardware, but that wouldn’t eliminate it. I primarily referred to languages characterized by indirection levels and late binding of user-defined operations, such as Lisp and Python, and to a lesser extent/confidence to side-effect-free languages like Haskell. I didn’t mean to say that high-level languages should not be used, in fact I think that their cost is wildly overestimated by many. However, denying the existence of any intrinsic cost guarantees that people will keep overestimating it, because if it weren’t that high a cost, why would you lie to them? I mean it very seriously; horrible tech marketing is responsible for the death (or coma) of many great things.
- Of all systems with similar cost and features, the one that has the least stuff implemented in hardware is the best, because you can change more things. The idea that moving things to hardware is a sure way to make them efficient is a misconception. Hardware can’t do “anything in one cycle”; there are many constraints involved. Therefore, it’s better to let the software explicitly control a set of low-level components than build hardware logic implementing high-level interfaces to them. For example, to add 2 numbers on a RISC machine, you load them to registers, then add. You could have a command adding operands from memory; it wouldn’t run faster, because the hardware would have to spend cycles on loading operands to (implicit) registers. Hardware doesn’t have to be a RISC machine, but it’s always better to move as much control to software as possible under the given system cost constraints.
I basically asked people to refute point 1 (”HLLs are costly”). What follows describes the attempts people made at it.
Computers you can’t program

Several readers managed to ignore my references to specific high-level languages and used the opportunity to pimp hardware architectures that can’t run those languages. Or any other programming languages designed for human beings, for that matter. Example architectures:
It is my opinion that the fans of this family of hardware/vaporware, consistent advocates of The New Age of Computing, have serious AI problems. Here’s a sample quote on cellular automata: “I guess they really are like us.” Well, if you want to build a computing device in order to have a relationship with it, maybe a cellular automaton will do the trick. Although I’d recommend to first check the fine selection of Homo Sapiens we have here on Planet Earth. Because those come with lots of features you’d like in a friend, a foe, a spouse or an employee already built-in, while computer hardware has a certain gap to fill in this department.

Me, I want to build machines to do stuff that someone “like us” wouldn’t want to do, for any of the several reasons (the job is hard/boring/stinky/whatever). And once I’ve built them, I want people to be able to use them. Please note this last point. People and other “nature’s computers”, like animals and fungi, aren’t supposed to be “used”. In fact, all those systems spend a huge amount of resources to avoid being used. Machines aren’t supposed to be like that. Machines are supposed to do what you want. Which means that both the designer and the user need to control them. Now, a computer that can’t even be tricked into parsing HTML in a straightforward way doesn’t look like it’s built to be controlled, does it?
Let me supply you with an example: Prolog. Prolog is an order of magnitude more tame than a neural net (and two orders of magnitude compared to a cellular automaton) when it comes to “control” - you can implement HTML parsing with it. But Prolog does show alarming signs of independence - it spends most of its time in its inference engine, an elaborate mechanism running lengthy non-trivial loops, which sometimes turn out to be infinite. You aren’t supposed to single-step those loops; you’re supposed to specify truths about your world, and Prolog will derive more truths for you. Prolog was supposed to be the wave of the future about 25 years ago. I think it can be safely called dead by now, despite the fair amount of money poured into it. I think it died because it’s extremely frustrating to use - you just can’t tell why the hell it worked that way in each particular case. I’ve never seen anything remotely as annoying as Prolog, with the notable exception of Makefiles, running on top of a wonderful inference engine of their own.
My current opinion is that neural networks rarely deserve a special hardware implementation - if you need them, build a more traditional computer and run them on top of that; and cellular automata are just stillborn. I might be wrong in the sense that a hardware implementation of these models is the optimal solution for some problem, hence we’ll see those beasts in some corner of a successful real-world system. But the vast majority of computing, including AI apps, will run on machines that support basic bread-and-butter programmer things simply and straightforwardly. Here’s a Computing Technology Acceptance Lower Bound for ya: if you can’t parse a frigging log file with it, you can’t do anything with it.
Self-assembly computers
Our next contestant is a machine that you surely can program, once you’ve built it from the pieces which came in the box. Some people mentioned “FPGA”, others failed to call it by its name (one comment mentioned a “giant hypercube of gates”, for example). In this part, I’m talking about the suggestions to use an FPGA without further advice on exactly how it should be used; that is, FPGA as the architecture, not FPGA used to prototype an architecture.
Maybe people think that with an FPGA, “everything is possible”, so in particular, you could easily build a processor efficiently implementing a HLL. Well, FPGA is just a way to implement hardware allowing you to trade NRE for unit cost. And with hardware, some things are possible and some aren’t, or so I claim - for example, you can’t magically make the cost of HLLs go away. If you can’t think of a way to reduce the overhead HLLs impose on the system cost, citing FPGA doesn’t make your argument look any better. On the contrary - you’ve saved NRE, but you’ve raised the cost of the hardware by the factor of 5.
Another angle: can you build a compiler? Probably so. Would you like to start your project with building a compiler? Probably not. Now, what makes people think that they want to build hardware themselves? I really don’t know. Building hardware is gnarly, FPGA or not - there are lots of constraints you have to think about to make the thing efficient, and it’s extremely easy to err on the side of not having enough flexibility. The latter typically happens because you try to implement overly high-level interfaces; it then turns out that you need the same low-level components to do something slightly different.
And changing hardware isn’t quite as easy as changing software, even with FPGA, because hardware description code, with its massive parallelism and underlying synthesis constraints, is fairly tricky. FPGA is a perfectly legitimate platform for hardware vendors, but an awful interface for application programmers. If you deliver FPGAs, make it your implementation detail; giving it to application programmers isn’t very likely to make them happy in the long run.
At the other end of the spectrum, there’s the kind of “self-assembly computer” that reassembles itself automatically, “adapting to the user’s needs”. Even if it made any sense (and it doesn’t), it still wouldn’t answer the question: how should this magical hardware adapt to handle HLLs, for example, indirect memory access?
Actual computers designed to run HLLs
Some people mentioned actual hardware which was built to run HLLs, including Reduceron, Tcl on Board, Lisp Machines, Rekursiv, and ARM’s Jazelle instruction set. For some reason, nobody mentioned Intel’s 432, an object-oriented microprocessor which was supposed to replace x86, but was, among other things, too slow. This illustrates that the existence of a “high-level processor” doesn’t mean that it was a good idea (of course it doesn’t mean the opposite, either).
I’ll now talk about these machines in increasing order of my confidence that the architecture doesn’t remove the overhead posed by the HLL it’s supposed to run.
- Reduceron is designed to run Haskell, and focuses on an optimization problem I wasn’t even aware of, that of graph reduction. One of the primary ideas seem to be that graph reduction doesn’t suffer from dependency problems which could inhibit parallelization, but still can’t be parallelized on stock CPUs. That’s because a lot of memory access is involved, and there’s typically little load/store bandwidth available to a CPU compared to its data processing capability. Well, I agree with this completely in the sense that memory access is the number one area where custom hardware design can help; more on that later. However, I’m not sure that the right way to go about it is to build a “Haskell Machine”; building a lower-level processor with lots of bandwidth available to it could be better. Then again, it could be worse, and my confidence level in this area is extremely low, which is why I list the Reduceron before the others: I think I’ll look into this whole business some more. Pure functional languages are a weak spot of mine; for now, I can only say three things for sure: (1) side effects are a huge source of bugs, (2) although they get in the way of optimizers, side effects are a poor man’s number one source of optimizations, so living without them isn’t easy, and (3) the Reduceron is a pretty cool project.
- Tcl on Board was built to run a Tcl dialect. Tcl doesn’t pose optimization problems that languages like Lisp or Python do - it’s largely a procedural language grinding flat objects. And there’s another thing I ought to tell you: I don’t like Tcl. However, I think that this Tcl chip is kind of insightful, because it’s designed for low-end applications. And the single biggest win of having a “high-level” instruction set is to save space on program encoding. Several people mentioned it as a big deal; I don’t think of it as a big deal, because instruction caches always worked great for me (~90% hits without any particular optimizations). However, for really small systems of the low-end embedded kind, program encoding is a real issue. I’m not saying that Tcl on Board is a good (or a bad) idea by itself; I know nothing about these things. I’m just saying that while I think high-level hardware will fail to deliver speed gains, it might give you space gains, so it may be the way to go for really small systems which aren’t supposed to scale. Not that I know much about those systems, except that if I’d have to build one, I’d seriously consider Forth…
- Lisp Machines ran Lisp, and Rekursiv ran LINGO, which apparently was somewhat similar to Smalltalk. This I know. What I don’t know is how the hardware support for the high-level features would eliminate the cost overhead of the HLLs involved; that’s because I don’t know the architecture, and nobody gave much detail. I don’t see a way to solve the fundamental problems. I mean, if I want to support arrays of bytes, then each byte must be tagged, doesn’t it? And if I only support fixnums larger than bytes, then I’d waste space, right? And just what could the LispM do about the hairy binding done by CLOS behind the scenes? Again, this doesn’t mean these machines weren’t a good idea; in fact I wish my desktop hardware were more expensive and more secure, and tagged architectures could help. All I’m saying is that it would be more expensive. I think. I’d like to hear more about LispM, simply because most people who used it seem to be very fond of it - I know just one exception.
- Jazelle is supposed to run Java. Java is significantly lower-level than Lisp or Smalltalk. It still is a beautiful example, because the hardware support in this case yields little performance benefits. In fact MIPS reported that a software implementation of JVM running on a MIPS core outperformed a JVM using Jazelle by a factor of about 2. I’ve never seen a refutation of that.
Stock computers with bells and whistles
Finally, there was a bunch of suggestions to add specific features to traditional processors.
- Content-addressable memory is supposed to speed up associative array look-ups. There’s a well-known aphorism by Alan Perlis - “A language that doesn’t affect the way you think about programming is not worth knowing”. Here’s my attempt at an aphorism: “A processor that doesn’t affect the way you access memory is not worth building”. This makes the wide variety of tools designed to help you build a SIMD VLIW machine with your own data processing instructions uninteresting to me, and on the other hand, makes CAM quite appealing. I came to believe that your biggest problem isn’t processing the data, it’s fetching the data. I might talk about it some time; the Reduceron, essentially designed to solve a memory access problem preventing the optimization of a “perfectly parallelizable” algorithm, is one example of this. However, CAM goes way beyond providing more bandwidth or helping with the addressing - it adds comparison logic to each memory word. While it sounds impractical to replace all of your RAM with CAM, stashing a CAM array somewhere inside your system could help with some problems. Then again, it won’t necessarily pay off - it depends on the exact details of what you’re doing. All I can say at this point is that it’s a Worthy Idea, which, for some reason, I keep forgetting about, and I shouldn’t.
- GC/reference counting optimizations. Maybe I’m wildly wrong, but I don’t think the garbage is a big deal, ’cause how much time do you spend on garbage collection compared to plain malloc/free? The way I see it, the problem isn’t so much with the overhead of garbage collection as it is with the amount of small objects allocated by the system and, most importantly, the amount of indirect memory accesses. I learned that some Lisp compilers can do object inlining with varying amounts of user intervention; well, when it works out, it removes the need for special hardware support. The thing is, I think the main battle here is to flatten objects, not to efficiently get rid of them. And I think that it’s quite clearly software that should fight that battle.
- Regular expression and string functions in hardware: I don’t think it’s worth the trouble, because how much time do you spend in regex matching anyway? Maybe it’s because I don’t process massive volumes of text, but when I do process the moderate amounts of text I bump into, there’s the part where you store your findings in data structures, and I think it might be the bottleneck. And then a huge amount of data comes from places like RDBMSes where you don’t have to parse much. You’d end up with idle silicon, quietly leaking power.
The good stuff
At the bottom line, there were two hardware-related things which captured my intoxicated imagination: the Reduceron and content-addressable memories. If anything ever materializes around this, I’ll send out some samples. In the meanwhile - thanks!
January 29th, 2008 — hardware, software
Do you love (”very”) high-level languages? Like Lisp, Smalltalk, Python, Ruby? Or maybe Haskell, ML? I love high-level languages.
Do you think high-level languages would run fast if the stock hardware weren’t “brain-damaged”/”built to run C”/”a von Neumann machine (instead of some other wonderful thing)”? You do think so? I have a challenge for you. I bet you’ll be interested.
Background:
- I work on the definition of custom instruction set processors (just finished one).
- It’s fairly high-end stuff (MHz/transistor count in the hundreds of millions).
- I also work on the related programming languages (compilers, etc.).
- Whenever application programmers have to deal with low-level issues of the machine I’m (partly) responsible for, I feel genuine shame. They should be doing their job; the machine details are my job. Feels like failure (even if “the state of the art” isn’t any better).
- …But, I’m also obsessed with performance. Because the apps which run on top of my stuff are ever-hungry, number-crunching real time monsters. Online computer vision. Loads of fun, and loads of processing that would make a “classic” DSP hacker’s eyeballs pop out of his skull.
My challenge is this. If you think that you know how hardware and/or compilers should be designed to support HLLs, why don’t you actually tell us about it, instead of briefly mentioning it? Requirement: your architecture should allow to run HLL code much faster than a compiler emitting something like RISC instructions, without significant physical size penalties. In other words, if I have so many square millimeters of silicon, and I pad it with your cores instead of, say, MIPS cores, I’ll be able to implement my apps in a much more high-level fashion without losing much performance (25% sounds like a reasonable upper bound). Bonus points for intrinsic support for vectorized low-precision computations.
If your architecture meets these requirements, I’ll consider a physical implementation very seriously (because we could use that kind of thing), and if it works out, you’ll get a chip so you can show people what your ideas look like. I can’t promise anything, because, as usual, there are more forces at play than the theoretical technical virtue of an idea. I can only promise to publicly announce that your idea was awesome and I’d love to implement it; not much, but it’s the best I can deliver.
If you can’t think of anything, then your consistent assertions about “stupid hardware” are a stupid bluff. Do us a favor and shut up. WARNING: I can’t do hardware myself, but there are lots of brilliant hardware hackers around me, and I’ve seen how actual chips are made and what your constraints are. Don’t bullshit me, buddy.
Seriously, I’m sick and tired of HLL weenie trash talk. Especially when it comes from apparently credible and exceedingly competent people.
Alan Kay, the inventor of Smalltalk: “Just as an aside, to give you an interesting benchmark—on roughly the same system, roughly optimized the same way, a benchmark from 1979 at Xerox PARC runs only 50 times faster today. Moore’s law has given us somewhere between 40,000 and 60,000 times improvement in that time. So there’s approximately a factor of 1,000 in efficiency that has been lost by bad CPU architectures.” … “We’re not going to worry about whether we can compile it into a von Neumann computer or not, and we will make the microcode do whatever we need to get around these inefficiencies because a lot of the inefficiencies are just putting stuff on obsolete hardware architectures.”
Jamie Zawinski, an author of Mozilla: “In a large application, a good garbage collector is more efficient than malloc/free.” … “Don’t blame the concept of GC just because you’ve never seen a good GC that interfaces well with your favorite language.” Elsewhere: “it’s a misconception that lisp is, by its nature, slow, or even slower than C” … “if you’re doing a *big* project in C or C++, well, you’re going to end up reinventing most of the lisp runtime anyway”
Steve Yegge, a great tech blogger: “The von Neumann machine is a convenient, cost-effective, 1950s realization of a Turing Machine, which is a famous abstract model for performing computations.” … “There are various other kinds of computers, such as convenient realizations of neural networks or cellular automata, but they’re nowhere as popular either, at least not yet”. And… “The Von Neumann architecture is not the only one out there, nor is it going to last much longer (in the grand 400-year scheme of things.)”
Wow. Sounds dazzling and mind-opening, doesn’t it? Except there isn’t any technical detail whatsoever. I mean, it’s important to be open-minded and stuff. It really is. The fact that something doesn’t seem “practical” doesn’t mean you shouldn’t think or talk about it. But if something isn’t even a something, just a vague idea about Awesome Coolness, it poisons the readers’ minds, people. It’s like talking about Spirituality of the kind that lets you jump over cliffs at your mighty will or something (I’m not that good at New Age, but I think they have things like these in stock). This can only lead to three results:
- Your reader ignores you.
- Your reader sits on a couch and waits to gain enough Spirituality to jump around cliffs. Congratulations! Your writing has got you one fat fanboy.
- Your reader assumes he’s Spiritual enough already and jumps off a cliff, so you’ve got a slim fanboy corpse.
It’s the same with this Great High-Level Hardware talk. I can ignore it, or I can wait forever until it emerges, or I can miserably fail trying to do it myself. Seriously, let’s look at these claims a little closer.
Alan Kay mentions a benchmark showing how lame our CPUs are. I’d really like to see that benchmark. Because I’ve checked out the B5000 which he praised in that article. And I don’t think a modern implementation of that architecture would beat a modern CPU in terms of raw efficiency. You see, RISC happened for a reason. Very roughly, it’s like this:
- You can access memories at single cycle throughput.
- You can process operands in registers at single cycle throughput.
- And that’s pretty much what you can do.
Suppose you want to support strings and have a string comparison instruction. You might think that “it’s done in the hardware”, so it’s blindingly fast. It isn’t, because the hardware still has to access memory, one word per cycle. A superscalar/VLIW assembly loop would run just as quickly; the only thing you’d save is a few bytes for instruction encoding. On the other hand, your string comparison thingie has got you into several sorts of trouble:
- Your machine is larger, with little gain - you don’t compare strings most of the time.
- Your machine is complicated, so optimizing the hardware is trickier.
- Compilers have trouble actually utilizing your instructions.
- Especially as the underlying hardware implementation grows more complicated and the performance of assembly code gets harder to model.
When people were supposed to write assembly programs, the inclusion of complicated high-level instructions was somewhat natural. When it became clear that compilers write most of the programs (because compilation became cheap enough), processors became less high-level; the points above hopefully explain why.
And don’t get me started about the tagging of data words. B5000 had polymorphic microcode - it would load two words, look at their type bits and add them according to the run time types. Well, B5000 didn’t support things like unsigned 8-bit integers, which happen to be something I need, because that’s how you store images, for example. Am I supposed to carry tag bits in each frigging byte? Let me point out that it has its cost. And I don’t think this sort of low-level polymorphism dwarfs the cost of Lisp or Smalltalk-style dynamic binding, either (B5000 was designed to run Algol; what would you do to run Smalltalk?)
There’s another angle to it: Alan Kay mentions that you almost couldn’t crash the B5000, which suited the business apps it was supposed to run quite well. I think that’s just awesome, I really do (I shoveled through lots of core dumps). In fact, I think people who implemented modern desktop operating systems and web browsers in unsafe languages on top of unsafe hardware are directly responsible for the vast majority of actual security problems out there. But (1) in many systems, the performance is really really important and (2) I think that security in software, the way it’s done in JVM or .NET, still has lower overall cost than tagging every byte (I’m less sure about part 2 because I don’t really know the guts of those VMs). Anyway, I think that hardware-enforced safety is costly, and you ought to acknowledge it (or really show why this fairly intuitive assumption is wrong, that is, delve into the details).
JWZ’s Lisp-can-be-efficient-on-stock-hardware claim isn’t much better than Smalltalk-can-be-efficient-on-custom-hardware, I find. Just how can it be? If you use Lisp’s static annotation system, your code becomes uglier than Java, and much less safe (I don’t think Lisp does static checking of parameter types, it just goes ahead and passes you an object and lets you think it’s an integer). If you use Lisp in the Lispy way that makes it so attractive in the first place, how on Earth can you optimize out the dynamic type checks and binding? You’d have to solve undecidable problems to make sense of the data flow. “A large project in C would implement the Lisp run time?” Oh really? You mean each variable will have the type LispObject (or PyObject or whatever)? Never happens, unless the C code is written by a deeply disturbed Lisp weenie (gcc and especially BetaPlayer, I’m talking about you). The fact that some people write C code as if they were a Lisp back-end is their personal problem, nothing more, nothing less.
The dynamic memory allocation business is no picnic, either. I won’t argue that garbage collection is significantly less efficient than manual malloc/free calls, because I’m not so sure about it. What I will argue is that a good Lisp program will use much more dynamic allocation and indirection levels than a good C program (again, I ignore the case of emulating C in Lisp, or Lisp in C, because I think it’s a waste of time anyway). And if you want to make your objects flat, I think you need a static type system, so you won’t be much higher-level than Java in terms of dynamic flexibility. And levels of indirection are extremely costly because every data-dependent memory access is awfully likely to introduce pipeline stalls.
Pure functional languages with static typing have their own problem - they lack side effects and make lots of copies at the interface level; eliminating those copies is left as an exercise to the compiler writer. I’ve never worked through a significant array of such exercises, so I won’t argue about the problems of that. I’ll just mention that static typing (irregardless of the type inference technique) characterizes lower-level languages, because now I have to think about types, just the way imperative programming is lower-level than functional programming, because now I have to think about the order of side effects. You can tell me that I don’t know what “high-level” means; I won’t care.
Now, the von Neumann machine business. Do you realize the extent to which memory arrays are optimized and standardized today? It’s nowhere near what happens with CPUs. There are lots of CPU families running lots of different instruction sets. All memories just load and store. Both static RAM (the expensive and fast kind) and dynamic RAM (the cheap and slower kind) are optimized to death, from raw performance to factory testing needed to detect manufacturing defects. You don’t think about memories when you design hardware, just the way you don’t think about the kind of floating point you want to use in your numeric app - you go for IEEE because so much intellectual effort was invested in it on all levels to make it work well.
But let’s go with the New Age flow of “von Neumann machine is a relic from the fifties”. What kinds of other architectures are there, and how do you program them, may I ask? “C is for von Neumann machines”. Well, so is Java and so is Lisp; all have contiguous arrays. Linked lists and dictionaries aren’t designed for any other kind of machine, either; in fact lots of standard big O complexity analysis assumes a von Neumann machine - O(1) random access.
And suppose you’re willing to drop standard memories and standard programming languages and standard complexity analysis. I don’t think you’re a crackpot, I really don’t; I think you’re bluffing, most probably, but you could be a brilliant and creative individual. I sincerely don’t think that anything practiced by millions can automatically be considered “true” or “right”; I was born in the Soviet Union, so I know all about it. Anyway, I want to hear your ideas. I have images. I must process those images and find stuff in them. I need to write a program and control its behavior. You know, the usual edit-run-debug-swear cycle. What model do you propose to use? Don’t just say “neural nets”. Let’s hear some details about hardware architecture.
I really want to know. I assume that an opinion held by quite some celebrities is shared by lots and lots of people out there. Many of you are competent programmers, some stronger than myself. Tell me why I’m wrong. I’ll send you a sample chip. I’ll publicly admit I was a smug misinformed dumbass. Whatever you like. I want to close this part of “efficient/high-level isn’t a trade-off” nonsense, so that I can go back to my scheduled ranting about the other part. You know, when I poke fun at C++ programmers who think STL is “high-level” (ha!). But until this “Lisp is efficient” (ha!) issue lingers, I just can’t go on ranting with clear conscience. Unbalanced ranting is evil, don’t you think?
January 26th, 2008 — software
Alan Kay on programming languages as user interfaces: “Even if you’re designing for professional programmers, in the end your programming language is basically a user-interface design. You will get much better results regardless of what you’re trying to do if you think of it as a user-interface design.”
Steve Yegge on the relative difficulty of high-level and low-level programming: “Observation: systems programmers look down on application programmers. Pop culture is that systems programming (kernels, drivers, real-time OSes, etc.) is harder, possibly since they’re equating app programming with just laying out the UI. I’ve done both — or all three, counting doing a bunch of tedious UI layouts in various frameworks. App-level programming is harder than systems programming.”
m4 manual page on Turing tar pit addictiveness: “Some people found m4 to be fairly addictive. They first use m4 for simple problems, then take bigger and bigger challenges, learning how to write complex m4 sets of macros along the way. Once really addicted, users pursue writing of sophisticated m4 applications even to solve simple problems, devoting more time debugging their m4 scripts than doing real work. Beware that m4 may be dangerous for the health of compulsive programmers.”