The nomadic programmer

There’s this broad metaphor I have, with no conclusions attached - just an attempt to describe dynamics. I’ve recently shared it with the commie ex-VP school teacher and he liked it, so I thought it could fit well with the other borderline stuff I host at Proper Fixation under the “wetware” category.

So. One recurring theme in the history of civilization is the conflict between nomadic and settled people. Nomads think that land is for feeding cattle and you move elsewhere once there’s nothing left to graze. Villagers figure that land is for growing food, so you settle on it and fertilize it and irrigate it and stuff. Initially, nomads typically dominate the landscape, periodically attacking the settled villagers and taking their crops. However, the settled people eventually accumulate enough surplus to support cities, nation states and standing armies, extending their control to more and more lands and eventually exterminating the nomadic lifestyle altogether.

The way I painted this picture, I tend to side with the hard-working settled folk, the nomads being the parasitic losers I’ve depicted, and I think most of us civilized humans share similar sentiments. However, in my metaphor I side with the nomadic programmer, at least to a large extent, and I do so because of the meaning my metaphor assigns to “land”.

The thing I find analogous to land in programming is problems, because that’s where programmers live. Programmers live on (in?) problems in the sense of dealing with broken things most of the time - once something starts working, you move on to something that doesn’t. In another sense, large problems or problem areas a programmer deals with define that programmer’s territory. The programmer is in immediate demand to the extent that solutions to “his” problems are in demand; problems feed programmers. Strong programmers seek, in one way or another, to expand their responsibility to encompass more problems, and to preserve their existing responsibilities. And so on.

Now if we restate the respective worldviews of nomads and settlers in the terms of this metaphor, we’ll get this. Nomads think that problems exist for solving them and you move elsewhere once there’s nothing left to graze. Settlers think that problems exist for growing them, so they settle on them and fertilize them and irrigate them and stuff.

And now you can see why I’m inclined to sympathize with the nomadic programmer. Two other things fueling this sympathy are issues of personality to be discussed soon, and the fate of the nomad to be discussed immediately. And while the nomad is no longer the parasite, rest assured that he’s still, in the long run, the loser.

Initially - in a young and small organization - nomadic programmers tend to dominate the landscape. There are more problems than people around. The nomadic programmer travels from one urgent problem to another, grazing through them as fast as he can. Occasionally he stumbles upon a settler who has settled on a problem near the nomad’s territory and grown crops of code there. Well, if the problem occupied by the settler becomes urgent, or if the crops stand in the way of solving the nomad’s adjacent urgent problem, the nomad will go ahead and brutally solve the settler’s problem, wiping out his crops. The politics of the invasion will be trivial - a promise to deliver by the nomad carries lots of weight at this stage and the settler will not issue a counter-promise (to deliver in his own way) because he’s a peaceful code-growing villager who isn’t into stress which necessarily comes with delivering quickly.

However, the time goes by and sure enough, the settled people accumulate quite some surplus. What you grow on land is surplus wheat; what you grow on problems is surplus code. Code that wouldn’t naturally grow on a problem - but now that the problem was fertilized by the original settlers, they’ve grown enough code on it to support whole cities, a nation state, and a standing army of programmers, all making a living by fiddling with this code.

The nomad starts running out of pasture. Sure enough, there are lots of problems just like there used to be. But you can no longer solve them because (1) now it’s the majority and not a minority of problems that are already owned by someone (growing them rather than solving them) and (2) in most cases invasion is no longer an option. Now that the problem is owned by a nation state, responsible for lots of code and with lots of people working on that code, the nomad’s promise to deliver quickly carries very little weight compared to the danger of irritating the sovereign. While it is quite likely still true that a nomad will probably deliver more quickly than the whole nation state team, the nomad will not be able to take over the entire responsibility of the team. (It is possible that the single reason for the latter is the problems grown by the team itself and that a few nomads could in fact handle the original problem. But it is irrelevant since problems that could have been avoided are no less real than problems that couldn’t.)

So if the organization, by some decision making mechanism, lets the nomad invade the territory of the settled team and solve the stupid problem, and then the offended team, by some decision making mechanism, fights back by effectively going on strike, there is nothing the nomad will be able to offer the organization at this point. Of course it doesn’t have to come to this, just like political conflicts don’t have to come to full-scale wars, or personal conflicts to fist fights or court hearings. It’s enough for the worst case scenario to be likely to work out in favor of A rather than B to shift the balance decisively in favor of A. Even if neither A nor B nor anyone making decisions affecting A and B actually think in terms of this scenario, things tend to evolve and adapt such that decisions are made in favor of A. And in our case, the nomadic programmer is B.

Solving problems just isn’t the big thing in this organization anymore, just like the quality of life experienced by the inhabitants of some territory isn’t the main theme in international politics. Perhaps there are ways to improve the quality of life in Siberia, however this is not nearly as important politically as the fact that there’s already a guy exclusively responsible for the quality of life in Siberia. Perhaps Socialism with Chinese Characteristics could yield improvements in the lives of Siberians that Managed Democracy could not, however, if the Chinese try to act on this assumption, there will be a nuclear war. If what remains of a nomadic tribe somewhere in the region makes a similar attempt, then it will remain no more.

The disgruntled nomadic programmer reduces his ambition to merely being left alone to wander the remaining wilderness. However, this option is no more real for him now than the option of being left alone was available to the settler in the old days. Back then, the settlers were never safe since a nomad could always bump into them in an attempt to solve a related problem, and if their stuff got in the way, he’d rewrite or delete/disable their stuff. Now it is the nomad who is never safe since the nation states keep expanding their responsibilities into neighboring problems - having enough people to have some of them free for that some of the time.

(Actually having even partially idle workers on a team leaves few satisfying alternatives to an attempt at expanding the team’s responsibilities since other teams are always happy to seize an idle worker. Likewise, back in the old days the nomadic programmer had few satisfying alternatives to invading and solving others’ problems since otherwise he couldn’t keep his promises to deliver. It’s not (just) the intentions that fuel wars, it’s (also) the situation.)

The nation states seeking to expand won’t fight each other since the nomad is a much easier target, not having resources (time and reports) to look over his entire territory. Once a nation state team managed to take over some of that ever-shrinking territory, the nomad will never gain it back. Increasingly, the nomad has to reach compromises with neighboring nation states whenever his work is related to their work. Then it turns out that in order to be able to work on what he wants at all, he has to do it the way a chief commander or an officer of a nation state team wants him to do it - and then that in order to work on anything at all, he has to report to such a manager.

At this point the nomadic programmer can use his reputation and seniority to get pseudo-promoted to a non-productive position. Alternatively, he can actually become a report of a nation state team manager with whom the relationship is likely already strained - and his seniority, reputation and ambitions won’t make the transition into this particular position of a report any smoother. Alternatively he can quit. His failure is now complete.

(It may sound like a natural thing for a nomad to change jobs fairly frequently - part of a lifestyle rather than the failure of that lifestyle. However, nomadic programmers are those who like to travel from problem to problem - not from job to job; some like the latter but some don’t. A new job at a new place means a temporary, but possibly significant loss of confidence and efficiency. An African nomad won’t necessarily welcome a relocation to Alaska.)

As I’ve said above, I have reasons having to do with my personality to side with the nomadic programmer, especially at the stage of mounting pressure from nation state teams. The people I tend to relate to most easily seem to be those who prefer freedom to power. A talented freedom-seeker with a strong sense of responsibility will accumulate, well, responsibilities much more quickly than reports - a lot of territory to wander, and no standing army to protect it. (The problem with reports is that you take their freedom by telling them what to do and they take your freedom through your responsibility for their actions; who wants reports?) Since many freedom-lovers disdain politics, they won’t respect international borders - a problem should be solved, dammit; hence they’re likely to initiate invasions.

However, while this means that I personally will tend to find myself sympathizing with particular nomadic programmers, this does not mean that theirs is the right way or something. For example, it is unclear which share of programming problems out there can really be “solved” - grazed through and left alone - and which problems actually require continuous care and gardening that a true nomad is not likely to supply. Also, whether there’s a “solution” you only need to “maintain” or an “infrastructure” you want to “extend”, the code needs a permanent owner. I don’t believe in collective code ownership any more than in collective ownership of anything else - what it usually means is that everybody collectively fights over something. Therefore I think that ownership should generally be respected, and so a compromise which is, from a technical viewpoint, quite moronic, can otherwise be a great thing - a belief outside the nomad’s way.

So while I know where my sympathies lie, I don’t know which camp I’m in and this is why this metaphor doesn’t come with any conclusions, just the dynamics. In fact I’d rather leave it without conclusions but I wouldn’t mind expanding more on the dynamics. For example, some - but not all - settled civilizations were actually started by nomads enslaving argicultural villagers and settling among them. Apparently a similar distinction can be made between nation state teams of programmers; it is then interesting whether differences in their behavior can be traced to their different origins. Perhaps a person more entertained than appalled by the sort of perspective on the adventurous lives of programmers here presented is also the kind of person more entertained than appaled by the history of mankind in general and so could help develop this line of thought based on his knowledge of history. Could be fun.

Update (2009-08-18) - Chuck Moore: “I’ve met too many people who want to make a career out of a project instead of completing it” - the nomad’s view of the settlers. Nomadism is apparent in other writing by Chuck Moore - his disdain for “complexity” (which implies dependency on large teams of people you ought to manage, annoying constraints imposed by systems made by someone else and other things nomads don’t like), his firm opinion that distinct projects should have distinct code bases (customizability and “reuse” imply complexity and otherwise reduce the chances to “hermetically close” and truly complete a project), etc.

Humans and compilers need each other: the VLIW SIMD case

The state of the art in optimizing compilers today is such that for optimizing code, you need (1) a strong optimizing compiler and (2) a strong optimizing human. My rule of thumb is that (1) alone will yield 2x to 10x slower code. This is also what a person selling a (great) compiler “giving 80% of the optimal performance with no manual intervention” once told off-record to a roomful of programmers who pressed him into a corner, elevating my rule of thumb to a nobler plane of anecdotal evidence.

Now, I claim that this situation will persist, and in this post I’ll try to close the fairly large gap between this claim and the mere acknowledgment of what the state of the art is today. The gap is particularly large for the believer in the possibility of strong AI - and while my position is a bit different, I do believe in fairly strong AI (can I say that? people keep telling that I can’t say “nearly context-free”. oh well.)

I realize that many people experienced in optimization feel that, on the contrary, there’s in fact no gap large enough to justify an attempt as boringly rigorous (for a pop tech blog) at proving what they think is obvious as will shortly follow. But I think that many language geek discussions could benefit from a stronger bound on the power of a Sufficiently Smart Compiler than can be derived from (necessarily vague) doubts on the power of AI, and in this post I’ll try to supply such a bound. I actually think a lot of (mainly domain-specific) things could be achieved by AI-ish work on compilation - closer to “identify bubble-sort and convert to quick-sort” than to traditional “analyze when variables are alive and assign them to registers” - and this is why it’s useful to have a feeling when not to go there.

So, consider chess, where the state of the art is apparently quite similar to that in optimization: a strong human player using a strong computer program will take out both a human and a computer playing alone. However, it is conceivable that a program can be developed that doesn’t need the help of a human, being able of completely simulating human thought processes or instead alternative processes which are consistently superior. Why can’t it be the same with optimizing compilers?

(Chess and optimization are similar in another respect - few care about them; I readily acknowledge the insignificance of a 10x speed-up in a continuously expanding set of circumstances, I just happen to work in an area where it does count as a checkmate.)

I’ll try to show that optimization is a fundamentally different game from chess, quite aside from the formal differences such as decidability. I’ll use optimizing for VLIW SIMD processors to show where compilers outperform humans and vice versa. I’ll be quoting a book by the inventor of VLIW called “Embedded Computing: A VLIW Approach” to support my position on the relative strength of humans and compilers in these cases. I’ll then try to show that my examples are significant outside the peculiarities of current hardware, and attempt to state the general reason why humans are indispensable in optimization.

VLIW SIMD

First, we’ll do the acronym expansion; skip it if you’ve been through it.

VLIW stands for “Very Long Instruction Word”. What it really means is that your target processor can be told to execute several instructions in parallel. For example: R0=Add R1,R2 and R3=Mul R0,R1 and R1=Shift R5,R6. For this to work, the processor ought to be able to add, multiply and shift in parallel, that is, its execution hardware must be packed into several units, each getting distinct inputs. The units can be completely symmetric (all supporting the same operations); more often, different units support different instruction sets (so, for example, only one unit in a processor can multiply, but two of them can add, etc.) A stinky thing to note about VLIW instructions is the register semantics. In the example instruction above, R0 is mentioned both as an input and as an output. When it’s mentioned as an input of Mul its old value is meant, and not the value computed by Add. This is somewhat natural since the whole point is to run Add and Mul in parallel so you don’t want Mul to wait for Add; but it’s confusing nonetheless. We’ll come back to this shortly.

SIMD stands for “Single Instruction, Multiple Data” and is known much more widely than VLIW, being available at desktop and server processor architectures like x86 and PowerPC (VLIW reigns the quieter embedded DSP domain, the most commercially significant design probably being TI’s C6000 family.) SIMD means that you have commands like R0=Add8 R1,R2, which does 8 additions given 2 inputs. The registers are thus treated as vectors of numbers - for example, uint8[16], or uint16[8], or uint32[4], assuming 16b registers. This establishes a preference for lower-precision numbers since you can pack more of them into a register and thus process more of them at a time: with uint16, you use Add8, but with uint8, you get to use the 2x faster Add16. We’ll come back to this, too.

Optimizing for VLIW targets

The basic thing at which VLIW shines is the efficient implementation of “flat” loops (where most programs spend most time); by “flat”, I mean that there are no nested if/elses or loops. The technique for implementing loops on VLIW machines is called modulo scheduling. The same technique is used on superscalar machines like modern x86 implementations (the difference from VLIWs being the instruction encoding semantics).

Since I couldn’t find a good introductory page to link to, we’ll run through a basic example of modulo scheduling right here. The idea is pretty simple, although when I first saw hardware designers doing it manually in a casual manner, I was deeply shocked (they do it for designing new hardware rather than programming existing hardware but it’s the same principle).

Suppose you want to compute a[i]=b[i]*c+d on a VLIW processor with 4 units, 2 of them capable of load/store operations, 1 with an adder and 1 with a multiplier. All units have single-cycle latency (that is, their output is available to the next instruction; real VLIW units can have larger latencies, so that several instructions will execute before the result reaches the output register.) Let’s assume that Load and Store increment the pointer, and ignore the need to test for the exit condition through the loop. Then a trivial assembly implementation of a[i]=b[i]*c+d looks like this:

LOOP:
R0=Load b++
R1=Mul R0,c
R2=Add R1,d
Store a++,R2

This takes 4 cycles per iteration, and utilizes none of the processor’s parallelism as each instruction only uses 1 of the 4 execution units. Presumably we could do better; in fact the upper bound on our performance is 1 cycle per iteration, since no unit has to be used more than once to implement a[i]=b[i]*c+d (if we had two multiplications, for example, then with only 1 multiplying unit the upper bound would be 2 cycles/iteration.)

What we’ll do now is blithely schedule all of the work to a single instruction, reaching the throughput suggested by our upper bound:

LOOP:
R0=LOAD b++ and R1=MUL R0,c and R2=ADD R1,d and STORE a++,R2

Let’s look at what this code is doing at iteration N:

  • b[N] is loaded
  • b[N-1] (loaded at the previous iteration into R0) is multiplied by c
  • b[N-2]*c (computed at the previous iteration from the old value of R0 and saved to R1) is added to d
  • b[N-3]*c+d is saved to a[N]

This shows why our naive implementation doesn’t work (it would be quite surprising if it did) - at iteration 0, b[N-1] to b[N-3] are undefined, so it makes no sense to do things depending on these values. However, starting at N=3, our (single-instruction) loop body seems to be doing its job just fine (except for storing the result to the wrong place - b ran away during the first 3 iterations). We’ll take care of the first iterations by adding a loop header - instructions which implement the first 3 iterations, only doing the stuff that makes sense in those iterations:

R0=Load b++
R0=Load b++ and R1=Mul R0,c
R0=Load b++ and R1=Mul R0,c and R2=Add R1,d
LOOP:
R0=Load b++ and R1=Mul R0,c and R2=Add R1,d and Store a++,R2

For similar reasons, we need a loop trailer - unless we don’t mind loading 3 elements past the end of a[], but I reckon you get the idea. So we’ll skip the trailer part, and move to the more interesting case - what happens when the loop body won’t fit into a single instruction. To show that, I can add more work to be done in the loop so it won’t fit into the units, or I can use a weaker imaginary target machine to do the same work which will no longer fit into the (fewer) units. The former requires more imaginary assembly code, so I chose the latter. Let’s imagine a target machine with just 2 units, 1 with Load/Store and one with Add/Mul. Then our upper bound on performance is 2 cycles per iteration. The loop body will look like this:

LOOP:
R0=Load b++ and R2=Add R1,d
R1=Mul R0,c and Store a++,R2

Compared to the single-instruction case, which was still readable (”Load and Mul and Add and Store”), this piece looks garbled. However, we can still trace its execution and find that it works correctly at iteration N (assuming we added proper header code):

  • At instruction 1 of iteration N, b[N] is loaded
  • At instruction 2 of iteration N, b[N] (loaded to R0 by instr 1 of iter N) is multiplied by c
  • At instruction 1 of iteration N, b[N-1]*c (computed in R1 by instr 2 of iter N-1) is added to d
  • At instruction 2 of iteration N, b[N-1]*c+d (computed in R2 by instr 1 of iter N) is stored to a[N]

In common VLIW terminology, the number of instructions in the loop body, known to the rest of humanity as “throughput”, is called “initiation interval”. “Modulo scheduling” is presumably so named because the instructions implementing a loop body are scheduled “modulo initiation interval”. In our second example, the operations in the sequence Load, Mul, Add, Store go to instructions 0,1,0,1 = 0%2,1%2,2%2,3%2. In our first example, everything goes to i%1=0 - which is why I needed an example with at least 2 instructions in a loop, “modulo 1″ being a poor way to illustrate “modulo”.

In practice, “modulo scheduling” grows more hairy than simply computing the initiation interval, creating a linear schedule for your program and then “wrapping it around” the initiation interval using %. For example, if for whatever reason we couldn’t issue Mul and Store at the same cycle, we could still implement the loop at the 2 cycles/iteration throughput, but we’d have to move the Mul forward in our schedule, and adjust the rest accordingly.

I’ve done this kind of thing manually for some time, and let me assure you that fun it was not. An initiation interval of 3 with 10-15 temporary variables was on the border of my mental capacity. Compilers, on the other hand, are good at this, because you can treat your input program as a uniform graph of operations and their dependencies, and a legal schedule preserving its semantics is relatively easy to define. You have a few annoyances like pointer aliasing which precludes reordering, but it’s a reasonably small and closed set of annoyances. Quoting “Embedded Computing: A VLIW Approach” (3.2.1, p. 92): “All of these problems have been solved, although some have more satisfyingly closed-form solution than others.” Which is why some people with years of experience on VLIW targets know almost nothing about modulo scheduling - a compiler does a fine job without their help.

The book goes on to say that “Using a VLIW approach without a good compiler is not recommended” - in other words, a human without a compiler will not perform very well. Based on my experience of hand-coding assembly for a VLIW, I second that. I did reach about 95% of the performance of a compiler that was developed later, but the time it took meant that many optimizations just wouldn’t fit into a practical release schedule.

Optimizing for SIMD targets

I will try to show that humans optimize well for SIMD targets and compilers don’t. I’ll quote “Embedded Computing: A VLIW Approach” more extensively in this section. A book on VLIW may not sound like the best source for insight on SIMD, however, I somewhat naturally haven’t heard of a book on SIMD stressing how compilers aren’t good at optimizing for it. But then I haven’t heard of a book stressing the opposite, either, and success papers I saw claimed at automatic vectorization was modest. Furthermore, the particular VLIW book I quote is in fact focusing on embedded DSP where SIMD is ubiquitous, and its central theme is the importance of designing processors in ways making them good targets for optimizing compilers. It sounds like a good place to look for tips on designing compilers to work well with SIMD and vice versa; and if they say they have no such tips, it’s telling.

And in fact the bottom line of the discussion on SIMD (which they call “micro-SIMD”) is fairly grim: “The ability of compilers to automatically extract micro-SIMD without hints (and in particular, without pointer alignment information) is still unproven, and manual code restructuring is still necessary to exploit micro-SIMD parallelism” (4.1.4, p. 143). This statement from 2005 is consistent with what (AFAIK) compilers can do today. No SIMD-targeted programming environment I know relieves you of the need to use intrinsics in your C code as in “a = Add8(b,c)”, where Add8 is a built-in function-looking operator translated to a SIMD instruction.

What I find fascinating though is the way they singled out pointer alignment as a particularly interesting factor necessitating “hints”. Sure, most newbies to SIMD are appalled when they find out about the need to align pointers to 16 bytes if you want to use instructions accessing 16 bytes at a time. But how much of a show-stopper can that be if we are to look at the costs and benefits more closely? Aligning pointers is easy, producing run time errors when they aren’t is easier, telling a compiler that they are can’t be hard (say, gcc has a __vector type modifier telling that), and alternatively generating two pieces of code - optimized for the aligned case and non-optimized for the misaligned case - isn’t hard, either (the book itself mentions still other option - generating non-optimized loop header and trailer for the misaligned sections of an array).

There ought to be more significant reasons for people to be uglifying their code with non-portable intrinsics, and in fact there are. The book even discusses them in the pages preceeding the conclusion - but why doesn’t it mention the more serious reasons in the conclusion? To me this is revealing of the difference between a programmer’s perspective and a compiler writer’s perspective, which is related to the difference between optimization and chess: in chess, there are rules.

For an optimizing programmer, SIMD instructions are a resource from which most benefit must be squeezed at any reasonable cost, including tweaking the behavior of the program. For an optimizing compiler, SIMD instructions are something that can be used to implement a piece of source code, in fact the preferable way to implement it - as long as its semantics are preserved. This means that a compiler obeys rules a programmer doesn’t, making winning impossible. A typical reaction of a compiler writer is to think of this as not his problem - his problem ending where program transformations preserving the semantics are exhausted. I think this is what explains the focus on things like pointer alignment (which a compiler can in fact solve with a few hints and without affecting the results of the program) at the expense of the substantive issues (which it can’t).

In the context of SIMD optimizations, the most significant example of rules obeyed by just one of the contestants has to do with precision, which the book mentions right after alignment in its detailed discussion of the problems with SIMD. “Even when we manipulate byte-sized quantities (as in the case of most pixel-based images, for example), the precision requirements of the majority of manipulation algorithms require keeping a few extra bits around (9, 12, and 16 are common choices) for the intermediate stages of an algorithm. …this forces us up to the next practical size of sub-word … reducing the potential parallelism by a factor of two up front.” They go on to say that a 32b register will end up keeping just 2 16b numbers, giving a 2x speed-up - modest considering all the cases when you won’t get even that due to other obstacles.

This argument shows the problems precision creates for the hardware implementation of SIMD. However, the precision of intermediate results isn’t as hard a problem as this presentation makes it sound, because intermediate results are typically kept in registers, not in memory. So to keep the extra bits in intermediate results, you can either use large registers for SIMD operations and not “general-purpose” 32b ones, or you can keep intermediate results in pairs of registers - as long as you have enough processing units to generate and further process these intermediate results. Both things are done by actual SIMD hardware.

However, the significant problems created by precision lie at the software side: the compiler doesn’t know how many bits it will need for intermediate results, nor when precision can be traded for performance. In C, the type of the intermediate results in the expression (a[i]*3+b[i]*c[i])>>d is int (roughly, 32b), even if a, b and c are arrays of 8b numbers, and the parenthesized expression can in fact exceed 16b. The programmer may know that b[i]*c[i] never exceeds, say, 20000 so the whole thing will fit in 16b. That C has no way of specifying precise ranges of values a variable can hold (as opposed to Lisp, of all rivals to the title of the most aggressively optimizing environment) doesn’t by itself make an argument since a way could be added, just like gcc added __vector, not to mention the option of using a different language. Specifying the ranges of b[i] and c[i] wouldn’t always suffice and we would have to further uglify the code to specify the range of the product (in case both b[i] and c[i] can be large by themselves but never together), but it could be done.

The real problem with having to specify such information to the compiler isn’t the lack of a standard way of spelling it, but that a programmer doesn’t know when to do it. If it’s me who is responsible for the low-level aspects of optimization, I’ll notice the trouble with an intermediate result requiring too many bits to represent. I will then choose whether to handle it by investigating the ranges of b[i] and c[i] and restricting them if needed, by moving the shift by d into the expression as in (a[i]*3>>d)+(b[i]*c[i]>>d) so intermediate results never exceed 16b, or in some other way. But if it’s the compiler who’s responsible, chances are that I won’t know that this problem exists at all.

There’s a trade-off between performance gains, precision losses and the effort needed to obtain more knowledge about the problem. A person can make these trade-offs because the person knows “what the program really does”, and the semantics of the source code are just a rendering of that informal spec from one possible perspective. It’s even worse than that - a person actually doesn’t know what the program really does until an attempt to optimize it, so even strong AI capable of understanding an informal spec in English wouldn’t be a substitute for a person.

A person can say, “Oh, we run out of bits here. OK, so let’s drop the precision of the coefficients.” Theoretically, and importantly for my claim, strong AI can also say that - but only if it operates as a person and not as a machine. I don’t claim that we’ll never reach a point where we have a machine powerful enough to join our team as a programmer, just that (1) we probably wouldn’t want to and (2) if we would, it wouldn’t be called a compiler, it would be called a software developer. That is, you wouldn’t press a button and expect to get object code from your source code, you’d expect a conversation: “Hey, why do you need so many bits here - it’s just a smoothing filter, do you really think anyone will notice the difference? Do you realize that this generates 4x slower code?” And then someone, perhaps another machine, would answer that yes, perhaps we should drop some of the bits, but let’s not overdo it because there are artifacts, and I know you couldn’t care less because your job ends here but those artifacts are amplified when we compute the gradient, etc.

This is how persons optimize, and while a machine could in theory act as a person, it would thereby no longer be a compiler. BTW, we have a compiler at work that actually does converse with you - it says that it will only optimize a piece of code if you specify that the minimal number of iterations executed is such and such; I think it was me who proposed to handle that case using conversation. So this discussion isn’t pure rhetoric. I really wish compilers had a -warn-about-missed-optimization-opportunities switch that would give advice of this kind; it would help in a bunch of interesting cases. I just think that in some cases, precision being one of them, the amount and complexity of interactions needed to make headway like that exceeds the threshold separating aggressive optimization from aggressive lunacy.

To be sure, there are optimization problems that could be addressed by strong AI. In the case of SIMD, the book mentions one such area - they call it “Pack, Unpack, and Mix”. “Some programs require rearranging the sub-words within a container to deal with the different sub-word layouts. From time to time, the ordering of the sub-words within a word (for example, coming from loading a word from memory) does not line up with the parallelism in the code… The only solution is to rearrange the sub-words within the containers through a set of permutation or copying operations (for example, the MIX operation in the HP PA-RISC MAX-2 extension).”

An example of this reordering problem is warping: computing a[i]=b[i*step+shift]. This is impossible to do in SIMD without a permutation instruction of the kind they mention (PowerPC’s AltiVec has vec_perm, and AFAIK x86’s SSE has nothing so you can’t warp very efficiently). However, even if an instruction is available, compilers are AFAIK unable to exploit it. I see no reason why sufficiently strong AI couldn’t manage to do such things with few hints in some interesting cases. I wouldn’t bet my money on it - I side with Mitch Kapor on the Turing Test bet, but it is conceivable like the invincible chess playing program, and unlike transformations requiring “small” changes of the semantics.

Significance

There are areas of optimization that are very significant commercially but hardly interesting in a theoretical discussion (and this here’s a distinctively theoretical discussion as is any discussion where the possibility of strong AI is supposed to be taken into account).

For example, register allocation for the x86 is exceedingly gnarly and perhaps an interesting argument could be made to defend the need for human intervention in this process in extreme cases (I wouldn’t know since I never seriously optimized for the x86). However, a general claim that register allocation makes compiler optimization hard wouldn’t follow from such an argument: on a machine with plentiful and reasonably uniform registers, it’s hard to imagine what a human can do that a compiler can’t do better, and almost everybody would agree that the single reason for not making hardware that way is a commercial one - to make an x86-compatible processor.

Now, I believe that both SIMD and VLIW instruction encodings don’t have this accidental nature, and more likely are part of the Right Way of designing high-performance processors (assuming that it makes no sense to move cost from software to hardware and call that a “performance gain”, that is, assuming that performance is measured per square millimeter of silicon). One argument of rigor worthy of a pop tech blog is that most high-end processors have converged to SIMD VLIW: they have instructions processing short vectors and they can issue multiple instructions in parallel; some do the latter in the “superscalar” way of having the hardware analyze dependencies between instructions at run time and others do it in the “actual VLIW” way of having the lack of dependencies proven and marked by the compiler, but you end up doing modulo scheduling anyway.

However, this can of course indicate uninformed consumer preference rather than actual utility (I type this on a noisy Core 2 Duo box running Firefox on top of XP, a job better handled by a cheaper, silent single-core - and I’m definitely a consumer who should have known better). So my main reasons for believing VLIW and SIMD are “right” are abstract considerations on building von Neumann machines:

  • You typically have lots of distinct execution hardware: a multiplier has little in common with a load/store unit. Up to a point, it will therefore make sense to support parallel execution of instructions on the different execution hardware. The cost of supporting it will be more I/O ports connecting the execution units with the register file - quite serious because of the multiplexers selecting the registers to read/write. However, the cost of not supporting it will be more execution hardware left unused for more time. So the optimum is unlikely to be “no parallel execution”, it’s likely “judicious parallel execution”.
  • It is cheaper to have few wide registers and wide buses between the register file and the execution units than it is to have many narrow registers and buses. That’s because the cost of the register file is proportional to the product of #registers and #buses to the execution units. It is thus significantly cheaper to have 1 unit with 4 8bx8b multipliers and 2 32b buses for the inputs then it is to have 4 units with 1 8bx8b multiplier in each and 8 8b buses for the inputs. It’s also cheaper to keep 4 bytes in 1 32b register than in 4 8b registers. Likewise, it is cheaper to have 4 multipliers in 1 processor than to have 4 full-blown processor cores, because each core would have, say, its own fetch and decode logic and instruction cache - which are in fact pure overhead. So if you have a von Neumann machine with registers and buses and instruction cache, it makes sense (up to a point) to add SIMD to make the best of that investment, and this is why commercial VLIWs have SIMD, although the VLIW theory recommends more units instead.

Since I believe that both VLIW and SIMD are essential for maximizing hardware performance, I also tend to think that optimizations needed to utilize these features are “mainstream” enough to support a broad claim about optimization in general. And the main point of my claim is that compilers can’t win in the optimization game, because part of that game is the ability to change the rules once you start losing.

Humans faced with a program they fail to optimize change the program, sometimes a little, sometimes a lot - I heard of 5×5 filters made 4×4 to run on a DSP. But even if we exclude the truly shameless cheating of the latter kind, the gentler cheating going into every serious optimization effort still requires to negotiate and to take responsibility in a way that a person - human or artificial - can, but a tool like a compiler can not.

Modulo scheduling is an example of the kinds of optimizations which in fact are best left to a compiler - the ones where rules are fixed: once the code is spelled, few can be added by further annotations by the author and hence the game can be won without much negotiations with the author; although sometimes a little interrogation can’t hurt.

Pearls of wisdom

Proper Fixation always had more unfinished drafts than posts, but recently it’s getting ridiculous. I do have a couple of drafts I seriously intend to finish (usually the drafts which don’t make it to posthood during the first 4 hours or so go to the eternal drafthood land.) Until I’m able to think this stuff out to the point where I can share the results of my thinking, I figured I could share the far less scarce resource of Wisdom with ya.

***

Since I’ve violated the Golden Rule of Helping Friends with their PC Problems and attempted to help a friend with his PC problem, expectedly wiping out his hard drive in vain, I had many opportunities to explain the Programmer Paradox: how can a programmer fail to make a computer do as he wishes? While the difficulty of debugging a program without the source proved hard to explain to laymen, I think I’ve found a metaphor that does a good job. A programmer is to the blue screen of death what Mikhail Kalashnikov is to a loaded AK-47: just as helpless a victim as any other mortal, except for having a profound understanding of the mechanisms of his execution.

***

I would like to get some statistics on file encryption. For example, of all the files on the planet, X% are encrypted. Of all those files, Y% will never be read by someone due to encryption. Of all those files, Z% will never be read by malicious intruders. If I could lay my hands on the value of just one of these unknowns, I’d pick Z, because at least 100-Z% of the files will never be read by their owners. I would bet on Z lying somewhere between 0 and 1.

***

One of the key traits of good code is the ease at which it can be modified. One of the key traits of bad code is the high cost of modifying it. So good code is likely to deteriorate until it’s bad enough to become hard to change, and bad code is likely to stay bad. In short, code has a strong tendency to end up bad.

This can sound worthlessly pessimistic, similarly, for example, to “It is easier to break a leg than it is to cure it, therefore, most legs end up broken.” However, I think it’s more analogous to aging - the accumulation of changes in an organism, observably causing most animals to end up dead. Similarly, code that is used will be changed, code that is changed will degrade, and code that degrades beyond a certain point will die.

***

Health tends to be simpler than disease. For example, everybody can brush their teeth but few people can treat cavities. Similarly, it’s not very hard to maintain a sane development environment, but pretty hard to deal with the tide of bugs and of long-living branches resulting from a failure to do so. However, I’m generally optimistic about the chances of such cavities to be treated, and as usual, the optimism is based on the pain they cause - a strong incentive to seek and reward treatment.

***

There’s this evolution vs Intelligent Design debate. Well, I don’t know about life on Earth, but I sure have hard time believing in Intelligent Design in software. Code has to repeatedly survive exposure to users upon whom its fate depends. Yes, “users” can be a set containing just the author, but only if it’s honest-to-God USAGE, that is, the author has to pay a price when the program is hard to use - like not getting important things done properly. Show me a program that someone finds useful and that wasn’t subject to such evolutionary pressure, but rather was Intelligently Designed as useful.

I think that my intense hatred of the word “design” has to do with its prominent place in the speech of software creationists. These people are likely to constantly complain about not having enough resources to do The Right Thing in the ugly real world. They are also likely to give you software that you hate enough to wish to kill them, and be articulate enough to convince you that the problem is at your end, and fail to notice how this latter ability quadruples your desire to slash their body into square millimeter pieces.

***

I’ll conclude with an off-topic request: if you know a good text advocating a collectivistic or other kind of heterodox approach to economics, I’d be very grateful for a reference. By “advocacy”, I mean a text for laymen expressing support for a certain set of policies (as opposed to merely criticizing the effects of existing policies) - like Milton Friedman’s “Capitalism and Freedom”, for example.

The C++ Sucks Series: the quest for the entry point

Suppose you run on the x86 and you don’t like its default FPU settings. For example, you want your programs to dump core when they divide by zero or compute a NaN, having noticed that on average, these events aren’t artifacts of clever numerical algorithm design, but rather indications that somebody has been using uninitialized memory. It’s not necessarily a good idea for production code, but for debugging, you can tweak the x86 FPU thusly:

//this is a Linux header using GNU inline asm
#include <fpu_control.h>
void fpu_setup() {
unsigned short cw;
_FPU_GETCW(cw);
cw &= ~_FPU_MASK_ZM;//Divide by zero
cw &= ~_FPU_MASK_IM;//Invalid operation
_FPU_SETCW(cw);
}

So you call this function somewhere during your program’s initialization sequence, and sure enough, computations producing NaN after the call to fpu_setup result in core dumps. Then one day someone computes a NaN before the call to fpu_setup, and you get a core dump the first time you try to use the FPU after that point. Because that’s how x86 maintains its “illegal operation” flags and that’s how it uses them to signal exceptions.

The call stack you got is pretty worthless as you’re after the context that computed the NaN, not the context that got the exception because it happened to be the first one to use the FPU after the call to fpu_setup. So you move the call to fpu_setup to the beginning of main(), but help it does not. That’s because the offending computation happens before main, somewhere in the global object construction sequence. The order of execution of the global object constructors is undefined by the C++ standard. So if you kindly excuse my phrasing - where should we shove the call to fpu_setup?

If you have enough confidence in your understanding of the things going on (as opposed to entering hair-pulling mode), what you start looking for is the REAL entry point. C++ is free to suck and execute parts of your program in “undefined” (random) order, but a computer still executes instructions in a defined order, and whatever that order is, some instructions ought to come first. Since main() isn’t the real entry point in the sense that stuff happens before main, there ought to be another function which does come first.

One thing that could work is to add a global object to each C++ translation unit, and have its constructor call fpu_setup(); one of those calls ought to come before the offending global constructor - assuming that global objects defined in the same translation unit will be constructed one after another (AFAIK in practice they will, although in theory the implementation could, for example, order the constructor calls by the object name, so they wouldn’t). However, this can get gnarly for systems with non-trivial build process and/or decomposition into shared libraries. Another problem is that compilers will “optimize away” (throw away together with the side effects, actually) calls to constructors of global objects which aren’t “used” (mentioned by name). You can work around that by generating code “using” all the dummy objects from all the translation units and calling that “using” code from, say, main. Good luck with that.

The way I find much easier is to not try to solve this “portably” by working against the semantics prescribed by the C++ standard, but instead rely on the actual implementation, which usually has a defined entry point, and a bunch of functions known to be called by the entry point before main. For example, the GNU libc uses a function called __libc_start_main, which is eventually called by the code at _start (the “true” entry point containing the first executed instruction, AFAIK; I suck at GNU/Linux and only know what was enough to get by until now.) In general, running `objdump -T <program> | grep start` (which looks for symbols from shared libraries - “nm <program>” will miss those) is likely to turn up some interesting function. In these situations, some people prefer to find out from the documentation, others prefer to crawl under a table and die of depression; the grepping individuals of my sort are somewhere in between.

Now, instead of building (correctly configure-ing and make-ing) our own version of libc with __libc_start_main calling the dreaded fpu_setup, we can use $LD_PRELOAD - an env var telling the loader to load our library first. If we trick the loader into loading a shared library containing the symbol __libc_start_main, it will override libc’s function with the same name. (I’m not very good at dynamic loading, but the sad fact is that it’s totally broken, under both Windows and Unix, in the simple sense that where a static linker would give you a function redefinition error, the dynamic loader will pick a random function of the two sharing a name, or it will call one of them from some contexts and the other one from other contexts, etc. But if you ever played with dynamic loading, you already know that, so enough with that.)

Here’s a __libc_start_main function calling fpu_setup and then the actual libc’s __libc_start_main:

#include <dlfcn.h>

typedef int (*fcn)(int *(main) (int, char * *, char * *), int argc, char * * ubp_av, void (*init) (void), void (*fini) (void), void (*rtld_fini) (void), void (* stack_end));
int __libc_start_main(int *(main) (int, char * *, char * *), int argc, char * * ubp_av, void (*init) (void), void (*fini) (void), void (*rtld_fini) (void), void (* stack_end))
{
fpu_setup();
void* handle = dlopen("/lib/libc.so.6", RTLD_LAZY | RTLD_GLOBAL);
fcn start = (fcn)dlsym(handle, "__libc_start_main");
(*start)(main, argc, ubp_av, init, fini, rtld_fini, stack_end);
}

Pretty, isn’t it? Most of the characters are spent on spelling the arguments of this monstrosity - not really interesting since we simply propagate whatever args turned up by grepping/googling for “__libc_start_main” to the “real” libc’s __libc_start_main. dlopen and dlsym give us access to that real __libc_start_main, and /lib/libc.so.6 is where my Linux box keeps its libc (I found out using `ldd <program> | grep libc`).

If you save this to a fplib.c file, you can use it thusly:

gcc -o fplib.so -shared fplib.c
env LD_PRELOAD=./fplib.so <program>

And now your program should finally dump core at the point in the global construction sequence where NaN is computed.

This approach has the nice side-effect of enabling you to “instrument” unsuspecting programs without recompiling them s.t. they run with a reconfigured FPU (to have them crash if they compute NaNs, unless of course they explicitly configure the FPU themselves instead of relying on what they get from the system.) But there are niftier applications of dynamic preloading, such as valgrind on Linux and .NET on Windows (BTW, I don’t know how to trick Windows into preloading, just that you can.) What I wanted to illustrate wasn’t how great preloading is, but the extent to which C++, the language forcing you to sink that low just to execute something at the beginning of your program, SUCKS.

Barf.

Corrections - thanks to the respective commenters for these:

1. Section 3.6.2/1 of the ISO C++ standard states, that “dynamically initialized [objects] shall be initialized in the order in which their definition appears in the translation unit”. So at least you have that out of your way if you want to deal with the problem at the source code level.

2. Instead of hard-coding the path to libc.so, you can pass RTLD_NEXT to dlsym.

The internal free market

This is going to be a bit atypical, because I’m going to talk, like, about organizing large teams of programmers. Which I rarely do, for the simple reason that it’s not my problem. I’m not a manager, I don’t think I’m likely to do a particularly good job as a manager in the near future, and I don’t want to be a manager. As far as I’m concerned - if your problem is organizing lots of people, you brought it upon yourself. So this “internal free market” thing, which tends to work well according to my observations, is an exception to my general rule of not making or thinking much about “organizational” observations.

So, free markets. Basically a way to create incentives (because you have to compete) at the cost of redundancy (because of duplicated efforts by many competitors). A redundancy vs dependencies issue, if you like - several competitors means less dependence on each - and since I generally think redundancy is a fair price to pay for removing dependencies, you can guess that I’m leaning towards free market fundamentalism.

At this stage I’m skipping the detours where I’m dragging exciting pseudo-related stuff into this, like the subprime mortgage crisis and the enigmatically overwhelming support for Barack Obama by top programmers and tech bloggers. I’m skipping that to get to the simple point of there being no good way for an employer to create incentives for programmers with money.

How exactly this trick fails to work, and what kind of LOCs you get when you pay by the LOC is out of my scope. If you’re sincerely surprised, there’s lots of material for you to browse - Econ 101 Management being as good a place to start as any. The single thing I have to say about the “financial incentive” method here is that its failure isn’t at all surprising to the free market fundamentalist.

In a free market, people solve their own problems, and (pay to) use the output of other people only when it helps them solve their problems, in a completely distributed way. Setting prices for things and subsidizing them is the trademark of a centralized, controlled economy. Now, does an employer paying by the LOC or by 1/#defects or whatever create an “internal free market”, or an “internal government bureaucracy”? Without looking at soft stuff, like a hypothetical offense to sacred engineering values carried by the act of creating incentives, we can see that subsidizing LOCs will create surplus LOCs, just the way it works with agricultural surplus and everywhere else.

What would we do if we wanted a real internal free market? “Free market” means that we want to have people solving their problems, and let them “pay” each other in some way to solve them, without us controlling the latter process. In our specific context:

  1. “People” means “strong programmers” (or at least decent - or else why did “we”, the employer, hire them, dammit?) Folks who, at the very least, like to be productive and have their stuff used. Maybe they also “like to solve puzzles”, but not all do. For example, I hate puzzles, and have a strong preference for Alexandrian solutions. But you should still hire me.
  2. For those people, “solving their problems” means delivering user-visible features. This is the basic responsibility of developers towards the organization, this is what the organization is capable of judging (it better be), and this is what links the whole operation to reality through the external market forces.

The only question is, what does it mean for a developer to “pay” another developer? Paying with money makes no sense, not with our definition of “people”. People who’re into those transactions tend to be self-employed. However, developers do have their own currency, karma points or whatever term you prefer to use for it (they are all irksome; economics is to life what proctology is to anatomy - it’s ugly because it’s true). I know two kinds:

  1. Time. You can “pay” to developers, teams and the whole organization by volunteering to do particularly unsatisfying but important/urgent work.
  2. Code. When someone uses your code, they are paying you (I repeat, it’s not you who are giving something to them; do not make this error or they’ll stop using your code.)

Time and code are not unlike gold and printed money, because you can’t make more time but you can make more code. However, proceeding with this analogy and trying to scale it to include inflation and such will expose my economical illiteracy and general lunacy to an extent making me want to stop now.

What we’ll do now is examine how “trading” time and code works in programming, and how it creates incentives to invest efforts into the most needed things similarly to the way prices do in free market economies. We’ll start with “trading code” - the less intuitive but the more fundamental kind of transaction.

You have to deliver something user-visible. The user couldn’t care less about the guts making up your program - how you parse things, what your communication protocols are, which optimized libraries you use for processing bottlenecks, etc. However, you do care about these things, because they are really needed for things to work. With all these things, you could reuse “infrastructure” others work on (for example, me), or you could roll your own (let’s ignore “international trade” where you use a third-party library for the moment).

To you, depending on me is a risk - who knows how many bugs I have, how well my stuff maps to your needs, etc. To me, on the other hand, having you reuse my code is the best thing that can happen to me during work hours. After work, better things can happen, in particular, those having to do with spending the money earned during work hours. But at work, the best thing that I can do is be productive and have others use my code. Lots of users is the workplace fortune equivalent to being rich in the real world. Do you see who pays who here?

What can an organization do to manage these infrastructure transactions?

  1. The “economical”, “capitalist” solution: leave them alone except for securing them. “Leaving them alone” means not controlling them - not mandating the development and reuse of infrastructure and not assigning workforce to it. This means that by making my modules reusable, I’m only trying to please my internal users, so I’m likely to (try to) invest most effort into what they find important and helpful for doing their ultimate job. “Securing transactions” means something similar to the way public companies are forced to expose their accounting. If something becomes reusable code, it ought to have proper documentation, versioning, etc., and the organization must make sure it does.
  2. The “political”, “socialist” solution: assign the task of developing a parser, an optimized library, etc. to a person/team - subsidize the parser (the price to a user is now lower - even if the parser is all buggy, a person is officially assigned to fix the bugs, and the responsibility for failures moves to that person and not to the one who decided to reuse the code). This means that the parser will be created even if in a “free company” nobody would want to develop and maintain it, knowing that most people wouldn’t take the risk of using it for the benefits it provides. Leading to surplus crops of parsers.
  3. A further improvement on 2, the “communist” solution: force everyone to use The Parser. This means there are no “economical” means to punish the author whatsoever - where “punishing” means “not paying” and “not paying” means “stop using the code”. However, there’s still hope: you have political means to punish the author. For example, poke fun at the goddamn nightmare infrastructure, yell at the author, yell at his manager, ask your manager to yell at his manager’s manager - a whole slew of counter-infrastructure measures. Victims of infrastructural communism use them all the time.

So this is how “trading code” is (more accurately, “can be”) a better way of evolving reusable “infrastructure” than centralized planning. In general, the only thing I’m discussing is the reusable stuff - that’s what organizations can optimize (or pessimize, creating useless “reusable” modules and not creating the actually needed ones). Nothing can be done about things which aren’t reusable by definition, belonging to a single “feature”/”project” - those will have to be written once and only once no matter what.

What’s wrong with this picture? Could be many things, but one thing I’ll talk about (because I have a good answer for it) is the problem of “instant gratification”/”disruptive changes”/”local optimums”/etc. There are grand things that just can’t be done by small incremental changes, the by-products of work on “specific features”. You really need a person/team assigned to these things. This is somewhat similar to economies of scale which can be achieved by purchasing expensive machinery. How are many small farmers/shoe makers/etc. going to raise money for that machinery without central planning, if they’re all busy with their small short-term profits?

This is where entrepreneurs come into play. Entrepreneurs are people with fire up their asses. Normal people want enough money to get by, enough money to not worry about money, or enough money to not have to work for money. Entrepreneurs want more money than they can sensibly spend during the decades of their lifetime. And they want it because they desperately need that money to feed the fire raging up their asses. When they see a potential for making truckloads of money, many of them are willing to put their own savings on the line to chase that chance.

This psychological profile is a speculation of mine - my best attempt to comprehend the inexplicable behavior of making efforts and burning nerve endings to make more money than you could possibly need. However, I do have motivation which is quite similar in the context of our “economics of programming” analogy. I’m a “programming entrepreneur”, or at least I have, um, the same trademark proctological fireworks. I’m thrilled by opportunities to make stuff that, like, everybody will use, everything will depend on, …and everyone will want a piece of me when it breaks - so? It’s still worth it.

I can’t make such stuff as a by-product of working on something reasonably user-visible. I need to be assigned to it. What are the savings that I can put on the line? Time invested into doing unsatisfying, but important work. I call my own way of making these deals “buying development with debugging”. I’m usually willing to debug the weirder of the urgent problems, although it’s not much fun by itself, because it translates to a lot of karma points. I can then spend those karma points by working on what I want 80% of the time, 20% being the continuous urgent debugging tax.

Again, there’s more than one way for that kind of “entrepreneur” to start a programming venture:

  1. The “economical” way - spend my own time implementing my ideas. Like a “real” entrepreneur putting his savings on the line and forced to look at his company bleeding that money if it doesn’t take off, I will want to stop as soon as possible when I realize that I’m failing. Those so-called “organizational karma points” you gain in the trenches have better uses than wasting on the development of worthless stuff nobody will use.
  2. The “political” way - convincing “the government” (a manager) that my idea is worth implementing, and have someone assigned to it. Now nobody wants to admit the failure early on. I’m not losing anything when someone else struggles with the implementation - “I could do it better”. The person working on the thing isn’t really held responsible for the failure, either - not his idea, so why not keep trying to make it work? Everybody wants to make his stuff work and be used, after all. And the manager won’t want to admit the failure because of all people, he’ll get most of the blame. Therefore, the worthless effort will not be stopped for a lot of time.

Free market supporters are sometimes blamed for disrespecting people and reducing human nature to primitive egoism. Well, the only thing I can say is that I sure am a Good Person (how could it be different?), I respect myself lots, I successfully “launched” more than one “programming venture” both ways - “economical” (DIY) and “political” (persuasion), and of each of these two kinds, some succeeded and some failed.

And believe you me, deep down I refuse to take responsibility for the failing “politically launched” projects even now when we talk about it. On the other hand, the “economically launched” failures are - seriously - the best thing that happened to me in my professional life. I attribute most of my occasional successes - or, more accurately, non-failures - to lessons learned from the DIY failures, which I had no choice but admit responsibility for. (Damn, that was painful. To the extent that wasn’t on my job description.)

Now, I’m not an “internal free market fundamentalist”, simply because I know much more about programming than I do about economics, and obnoxious/oversimplified opinions usually correlate with ignorance. However, my experience seems to show that “internal free markets” are healthy enough to sustain continuous improvements on many scales, and eventually punish both “greedy” “instant gratification” techniques of pleasing managers/customers and architectural masturbation, promoting solid work.

And if you’re not a manager (I mostly care about non-managers, guys and gals like me, you know), I think this quasi-economical angle can contribute to your ability to look at some young initiative around you and say “Hm, this might work out” and conversely “Epic fail on the way, I’m not going to touch this with a laser pointer, man”. So, FYI.

Consistency: how to defeat the purpose of IEEE floating point

I don’t know much about the design of IEEE floating point, except for the fact that a lot of knowledge and what they call “intellectual effort” went into it. I don’t even know the requirements, and I suspect those were pretty detailed and complex (for example, the benefits of having a separate representation for +0 and -0 seem hard to grasp unless you know about the very specific and hairy examples in the complex plane). So I don’t trust my own summary of the requirements very much. That said, here’s the summary: the basic purpose of IEEE floating point is to give you results of the highest practically possible precision at each step of your computation.

I’m not going to claim this requirement is misguided, because I don’t feel like arguing with people two orders of magnitude more competent than myself who have likely faced much tougher numerical problems than I’ve ever seen. What I will claim is that differences in numerical needs divide programmers into roughly three camps, and the highest-possible-precision approach hurts one of them really badly, and so has to be worked around in ways we’ll discuss. The camps are:

  1. The huge camp of people who do businessy accounting. Those should work with integral types to get complete, deterministic, portable control over rounding and all that. Many of the clueless people in this camp represent 1 dollar and 10 cents as the floating point number 1.1. While they are likely a major driving force behind economical growth, I still think they deserve all the trouble they bring upon themselves.
  2. The tiny camp doing high-end scientific computing. Those are the people who can really appreciate the design of IEEE floating point and use its full power. It’s great that humanity accidentally satisfied the needs of this small but really cool group, making great floating point hardware available everywhere through blind market forces. It’s like having a built-in Stradivari in each home appliance. Yes, perhaps I exaggerate; I get that a lot.
  3. The sizable camp that deals with low-end to mid-range semi-scientific computing. You know, programs that have some geometry or physics or algebra in them. 99.99% of the code snippets in that realm work great with 64b floating point, without the author having invested any thought at all into “numerical analysis”. 99% of the code works with 32b floats. When someone stumbles upon a piece of code in the 1% and actually witnesses fatal precision loss, everybody gathers to have a look as if they heard about a beautiful rainbow or a smoke suggesting a forest fire near the horizon.

The majority of people who use and actually should be using floating point are thus in camp 3. Those people don’t care about precision anywhere near camp 2, nor do they know how to make the best of IEEE floating point in the very unlikely circumstances where their naive approach will actually fail to work. What they do care about though is consistency. It’s important that things compute the same on all platforms. Perhaps more importantly for most, they should compute the same under different build settings, most notably debug and release mode, because otherwise you can’t reproduce problems.

Side note: I don’t believe in build modes; I usually debug production code in release mode. It’s not just floating point that’s inconsistent across modes - it’s code snippets with behavior undefined by the language, buggy dependence on timing, optimizer bugs, conditional compilation, etc. Many other cans of worms. But the fact is that most people have trouble debugging optimized code, and nobody likes it, so it’s nice to have the option to debug in debug mode, and to do that, you need things to reproduce there.

Also, comparing results of different build modes is one way to find worms from those other cans, like undefined behavior and optimizer bugs. Also, many changes you make are optimizations or refaptorings and you can check their sanity by making sure they didn’t change the results of the previous version. As we’ll see, IEEE FP won’t give you even that, regardless of platforms and build modes. The bottom line is that if you’re in camp 3, you want consistency, while the “only” things you can expect from IEEE FP is precision and speed. Sure, “only” should be put in quotes because it’s a lot to get, it’s just a real pity that fulfilling the smaller and more popular wish for consistency is somewhere between hard and impossible.

Some numerical analysts seem annoyed by the camp 3 whiners. To them I say: look, if IEEE FP wasn’t the huge success that it is in the precision and speed departments, you wouldn’t be hearing from us because we’d be busy struggling with those problems. What we’re saying is the exact opposite of “IEEE FP sucks”. It’s “IEEE FP is so damn precise and fast that I’m happy with ALL of its many answers - the one in optimized x86 build, the one in debug PowerPC build, the one before I added a couple of local variables to that function and the one I got after that change. I just wish I consistently got ONE of these answers, any of them, but the same one.” I think it’s more flattering than insulting.

I’ve accumulated quite some experience in defeating the purpose of IEEE floating point and getting consistency at the (tiny, IMO) cost of precision and speed. I want to share this knowledge with humanity, with the hope of getting rewarded in the comments. The reward I’m after is a refutation of my current theory that you can only eliminate 95%-99% of the pain automatically and have to solve the rest manually each time it raises its ugly head.

The pain breakdown

I know three main sources of floating point inconsistency pain:

  1. Algebraic compiler optimizations
  2. “Complex” instructions like multiply-accumulate or sine
  3. x86-specific pain not available on any other platform; not that ~100% of non-embedded devices is a small market share for a pain.

The good news is that most pain comes from item 3 which can be more or less solved automatically. For the purpose of decision making (”should we invest energy into FP consistency or is it futile?”), I’d say that it’s not futile and if you can cite actual benefits you’d get from consistency, than it’s worth the (continuous) effort.

Disclaimer: I only discuss problems I know and to the extent of my understanding. I have no solid evidence that this understanding is complete. Perhaps the pain breakdown list should have item 4, and perhaps items 1 to 3 have more meat than I think. And while I tried to get the legal stuff right - which behavior conforms to IEEE 754, which conforms to C99, and which conforms to nothing but is still out there - I’m generally a weak tech lawyer and can be wrong. I can only give you the “worked on my 4 families of machines” kind of warranty.

Algebraic compiler optimizations

Compilers, or more specifically buggy optimization passes, assume that floating point numbers can be treated as a field - you know, associativity, distributivity, the works. This means that a+b+c can be computed in either the order implied by (a+b)+c or the one implied by a+(b+c). Adding actual parentheses in source code doesn’t help you one bit. The compiler assumes associativity and may implement the computation in the order implied by regrouping your parentheses. Since each floating point operation loses precision on some of the possible inputs, the code generated by different optimizers or under different optimization settings may produce different results.

This could be extremely intimidating because you can’t trust any FP expression with more than 2 inputs. However, I think that programming languages in general don’t allow optimizers to make these assumptions, and in particular, the C standard doesn’t (C99 §5.1.2.3 #13, didn’t read it in the document but saw it cited in two sources). So this sort of optimization is actually a bug that will likely be fixed if reported, and at any given time, the number of these bugs in a particular compiler should be small.

I only recall one recurring algebraic optimization problem. Specifically, a*(b+1) tended to be compiled to a*b+a in release mode. The reason is that floating-point literal values like 1 are expensive; 1 becomes a hairy hexadecimal value that has to be loaded from a constant address at run time. So the optimizer was probably happy to optimize a literal away. I was always able to solve this problem by changing the spelling in the source code to a*b+a - the optimizer simply had less work to do, while the debug build saw no reason to make me miserable by undoing my regrouping back into a*(b+1).

This implies a general way of solving this sort of problem: find what the optimizer does by looking at the generated assembly, and do it yourself in the source code. This almost certainly guarantees that debug and release will work the same. With different compilers and platforms, the guarantee is less certain. The second optimizer may think that the optimization you copied from the first optimizer into your source code is brain-dead, and undo it and do a different optimization. However, that means you target two radically different optimizers, both of which are buggy and can’t be fixed in the near future; how unlucky can you get?

The bottom line is that you rarely have to deal with this problem, and when it can’t be solved with a bug report, you can look at the assembly and do the optimization in the source code yourself. If that fails because you have to use two very different and buggy compilers, use the shotgun described in the next item.

“Complex” instructions

Your target hardware can have instructions computing “non-trivial” expressions beyond a*b or a+b, such as a+=b*c or sin(x). The precision of the intermediate result b*c in a+=b*c may be higher than the size of an FP register would allow, had that result been actually stored in a register. IEEE and the C standard think it’s great, because the single instruction generated from a+=b*c is both faster and more precise than the 2 instructions implementing it as d=b*c, a=a+d. Camp 3 people like myself don’t think it’s so great, because it happens in some build modes but not others, and across platforms the availability of these instruction varies, as does their precision.

AFAIK the “contraction” of a+=b*c is permitted by both the IEEE FP standard (which defines FP + and *) and the C standard (which defines FP types that can map to standards other than IEEE). On the other hand, sin(x), which also gets implemented in hardware these days, isn’t addressed by either standard - to the same effect of making the optimization perfectly legitimate. So you can’t solve this by reporting a bug the way you could with algebraic optimizations. The other way in which this is tougher is that tweaking the code according to the optimizer’s wishes doesn’t help much. AFAIK what can help is one of these two things:

  1. Ask the compiler to not generate these instructions. Sometimes there’s an exact compiler option for that, like gcc’s platform-specific -mno-fused-madd flag, or there’s (a defined and actually implemented) pragma directive such as #pragma STDC FP_CONTRACT. Sometimes you don’t have any of that, but you can lie to the compiler that you’re using an older (compatible) revision of the processor architecture without the “complex” instructions. The latter is an all-or-nothing thing enabling/disabling lots of stuff, so it can degrade performance in many different ways; you have to check the cost.
  2. If compiler flags can’t help, there’s the shotgun approach I promised to discuss above, also useful for hypothetical cases of hard-to-work-around algebraic optimizations. Instead of helping the optimizer, we get in its way and make optimization impossible using separate compilation. For example, we can convert a+=b*c to a+=multiply_dammit(b,c); multiply_dammit is defined in a separate file. This makes it impossible for most optimizers to see the expression a+=b*c, and forces them to implement multiplication and addition separately. Modern compilers support link-time inlining and then they do optimize the result as a whole. But you can disable that option, and as a side effect speed up linkage a great deal; if that seriously hurts performance, your program is insane and you’re a team of scary ravioli coders.

The trouble with the shotgun approach, aside from its ugliness, is that you can’t afford to shoot at the performance-critical parts of your code that way. Let us hope that you’ll never really have to choose between FP consistency and performance, as I’ve never had to date.

x86

Intel is the birthplace of IEEE floating point, and the manufacturer of the most camp-3-painful and otherwise convoluted FP hardware. The pain comes, somewhat understandably, from a unique commitment to the IEEE FP philosophy - intermediate results should be as precise as possible; more on that in a moment. The “convoluted” part is consistent with the general insanity of the x86 instruction set. Specifically, the “old” (a.k.a “x87″) floating point unit uses a stack architecture for addressing FP operands, which is pretty much the exact opposite of the compiler writer’s dream target, but so is the rest of x86. The “new” floating point instructions in SSE don’t have these problems, at the cost of creating the aesthetic/psychiatric problem of actually having two FP ISAs in the same processor.

Now, in our context we don’t care about the FP stack thingie and all that, the only thing that matters is the consistency of precision. The “old” FP unit handles precision thusly. Precision of stuff written to memory is according to the number of bits of the variable, ’cause what else can it be. Precision of intermediate results in the “registers” (or the “FP stack” or whatever you call it) is defined according to the FPU control & status register, globally for all intermediate results in your program.

By default, it’s 80 bits. This means that when you compute a*b+c*d and a,b,c,d are 32b floats, a*b and c*d are computed in 80b precision, and then their 80b sum is converted to a 32b result in memory (if a*b+c*d is indeed written to memory and isn’t itself an “intermediate” result). Indeed, what’s “intermediate” in the sense of not being written to memory and what isn’t? That depends on:

  1. Debug/release build. If we have “float e=a*b+c*d”, and e is only used once right in the next line, the optimizer probably won’t see a point in writing it to memory. However, in a debug build there’s a good reason to allocate it in memory, because if you single-step your program and you’re already past the line that used e, you still might want to look at the value of e, so it’s good that the compiler kept a copy of it for the debugger to find.
  2. The code “near” e=a*b+c*d according to the compiler’s notion of proximity also affects its decisions. There are only so many registers, and sometimes you run out of them and have to store things in memory. This means that if you add or remove code near the line or in inline functions called near the line, the allocation of intermediate results may change.

Compilers could have an option asking them to hide this mess and give us consistent results. The problems with this are that (1) if you care about cross-platform/compiler consistency, then the availability of cross-mode consistency options in one compiler doesn’t help with the other compiler and (2) for some reason, compilers apparently don’t offer this option in a very good way. For example, MS C++ used to have a /fltconsistency switch but seems to have abandoned it in favor of an insane special-casing of the syntax float(a*b)+float(c*d) - that spelling forces consistency (although the C++ standard doesn’t assign it a special meaning not included in the plain and sane a*b+c*d).

I’d guess they changed it because of the speed penalty it implies rather than the precision penalty as they say. I haven’t heard about someone caring both about consistency and that level of precision, but I did hear that gcc’s consistency-forcing -ffloat-store flag caused notable slowdowns. And the reason it did is implied by its name - AFAIK the only way to implement x86 FP consistency at compile time is to generate code storing FP values to memory to get rid of the extra precision bits. And -ffloat-store only affects named variables, not unnamed intermediate results (annoying, isn’t it?), so /fltconsistency, assuming it actually gave you consistency of all results, should have been much slower. Anyway, the bottom line seems to be that you can’t get much help from compilers here; deal with it yourself. Even Java gave up on its initial intent of getting consistent results on the x87 FPU and retreated to a cowardly strictfp scheme.

And the thing is, you never have to deal with it outside of x86 - all floating point units I’ve come across, including the ones specified by Intel’s SSE and SSE2, simply compute 32b results from 32b inputs. People who decided to do it that way and rob us of quite some bits of precision have my deepest gratitude, because there’s absolutely no good way to work around the generosity of the original x86 FPU designers and get consistent results. Here’s what you can do:

  1. Leave the FP CSR configured to 80b precision. 32b and 64b intermediate results aren’t really 32b and 64b. The fact that it’s the default means that if you care about FP result consistency, intensive hair pulling during your first debugging sessions is an almost inevitable rite of passage.
  2. Set the FP CSR to 64b precision. If you only use 64b variables, debug==release and you’re all set. If you have 32b floats though, then intermediate 32b results aren’t really 32b. And usually you do have 32b floats.
  3. Set the FP CSR to 32b precision. debug==release, but you’re far from “all set” because now your 64b results, intermediate or otherwise, are really 32b. Not only is this a stupid waste of bits, it is not unlikely to fail, too, because 32b isn’t sufficient even for some of the problems encountered by camp 3. And of course it’s not compatible with other platforms.
  4. Set the FP CSR to 64b precision during most of the program run, and temporarily set it to 32b in the parts of the program using 32b floats. I wouldn’t go down that path; option 4 isn’t really an option. I doubt that you use both 32b and 64b variables in a very thoughtful way and manage to have a clear boundary between them. If you depend on the ability of everyone to correctly maintain the CSR, then it sucks sucks sucks.

Side note: I sure as hell don’t believe in “very special” “testing” build/running modes. For example, you could say that you have a special mode where you use option (3) and get 32b results, and use that mode to test debug==release or something. I think it’s completely self-defeating, because the point of consistency is being able to reproduce a phenomenon X that happens in a mode which is actually important, in another mode where reproducing X is actually useful. Therefore, who needs consistency across inherently useless modes? We’d be defeating the purpose of defeating the purpose of IEEE floating point.

Therefore, if you don’t have SSE, the only option is (2) - set the FP CSR to 64b and try to avoid 32b floats. On Linux, you can do it with:

#include <fpu_control.h>
fpu_control_t cw;
_FPU_GETCW(cw);
cw = (cw & ~_FPU_EXTENDED) | _FPU_DOUBLE;
_FPU_SETCW(cw);

Do it first thing in main(). If you use C++, you should do it first thing before main(), because people can use FP in constructors of global variables. This can be achieved by figuring out the compiler-specific translation unit initialization order, compiling your own C/C++ start-up library, overriding the entry point of a compiled start-up library using stuff like LD_PRELOAD, overwriting it in a statically linked program right there in the binary image, having a coding convention forcing to call FloatingPointSingleton::instance() before using FP, or shooting the people who like to do things before main(). It’s a trade-off.

The situation is really even worse because the FPU CSR setting only affects mantissa precision but not the exponent range, so you never work with “real” 64b or 32b floats there. This matters in cases of huge numbers (overflow) and tiny numbers (double rounding of subnormals). But it’s bad enough already, and camp 3 people don’t really care about the extra horror; if you want those Halloween stories, you can find them here. The good news are that today, you are quite likely to have SSE2 and very likely to have SSE on your machine. So you can automatically sanitize all the mess as follows:

  1. If you have SSE2, use it and live happily ever after. SSE2 supports both 32b and 64b operations and the intermediate results are of the size of the operands. BTW, mixed expressions like a+b where a is float and b is double don’t create consistency problems on any platform because the C standard specifies the rules for promotion precisely and portably (a will be promoted to double). The gcc way of using SSE2 for FP is -mfpmath=sse -msse2.
  2. If you only have SSE, use it for 32b floats which it does support (gcc: -mfpmath=sse -msse). 64b floats will go to the old FP unit, so you’ll have to configure it to 64b intermediate results. Everything will work, the only annoying things being (1) the retained necessity to shoot the people having fun before main and (2) the slight differences in the semantics of control flags in the old FP and the SSE FP CSR, so if you configure your own policy, floats and doubles will not behave exactly the same. Neither problem is a very big deal.

Interestingly, SSE with its support for SIMD FP commands actually can make things worse in the standard-violating-algebraic-optimizations department. Specifically, Intel’s compiler reportedly has (had?) an optimization which unrolls FP accumulation loops and reorders additions in order to utilize SIMD FP commands (gcc 4 does that, too, but only if you explicitly ask for trouble with -funsafe-math-optimizations or similar). But I wouldn’t conclude anything from it, except that automatic vectorization, which is known to work only on the simplest of code snippets, actually doesn’t work even on them.

Summary: use SSE2 or SSE, and if you can’t, configure the FP CSR to use 64b intermediates and avoid 32b floats. Even the latter solution works passably in practice, as long as everybody is aware of it.

I think I covered everything I know except for things like long double, FP exceptions, etc. - and if you need that, you’re not in camp 3; go away and hang out with your ivory tower numerical analyst friends. If you know a way to automate away more pain, I’ll be grateful for every FP inconsistency debugging afternoon your advice will save me.

Happy Halloween!

Off topic

  1. To comment, you no longer need to register, just type “y” to confirm you’re a human. Thanks to Aristotle Pagaltzis for pointing out that the previous arrangement sucked.
  2. I’ve started another blog, mostly hosting images. For example:

I originally intended to have one blog for everything, but since you’ve probably subscribed for the technobabble, I’ll reserve the channel for that.

I want a struct linker

Here’s a problem I’ve seen a lot (it’s probably classified as an “Antipattern” or a “Code Smell” and as such has a name in the appropriate circles, but I wouldn’t know, so I’ll leave it nameless).

You have some kind of data structure that you pass around a lot. Soon, the most valuable thing about the structure isn’t the data it keeps, but the fact that it’s available all the way through some hairy flow of control. If you want to have your data flow through all those pipes, just add it to The Data Structure. (To antipattern classification enthusiasts: I don’t think we have a god object yet because we really want to pass our data through that flow and it’s right to have one kind of structure for that and not, say, propagating N+1 function parameters.)

Now suppose the structure holds an open set of data. For example, a spam filter could have a data structure to which various passes add various cues they extract from the message, and other passes can access those cues. We don’t want the structure to know what passes exist and what cues they extract, so that you can add a pass without changing the structure.

I don’t think there’s a good way to do it in a 3GL. In C or C++, you can:

  • Aggregate the cue structures by value (which means you have to recompile everything once you change/add/remove a member from any of them)
  • Keep pointers to the cue structures and use forward declarations to avoid recompilation (a bit slower, and you still have to recompile when you add/remove a whole cue structure)
  • Keep an array of void* or base class objects (not debugger-friendly, and requires a registration procedure to resize the arrays according to the number of passes and deal dynamically computed indexes to the cues to all who wish to access them)
  • Keep a key -> void* map (increasingly slow and debugger-unfriendly, and you need registration to compute the keys from cue names, or use the C substitute for interning - use pointers to global variables with names like &g_my_cue_key as keys)
  • Keep a string -> void* map (no registration or pseudo-interning, but really slow)

On top of JVM or .NET, you have pretty much the same options, plus the option to generate the cue container structure dynamically. Each cue would define an interface and the container structure would implement all those interfaces. The debugger would display containers nicely, and the code accessing them wouldn’t depend on the container class. I’d guess nobody does that though because the class generation part is likely somewhat gnarly.

In a 4GL, you can add attributes to class objects at run time. This is similar to keeping a key->pointer map in a 3GL, except the name interning is handled by the system as it should, and you don’t confuse debuggers because you’re using a standard feature of the object system. Which solves everything except for the speed issue, which is of course dwarfed by other 4GL speed issues.

Now, I used to think of it as one of the usual speed vs convenience trade-offs, but I no longer think it is, because a struct linker could solve it.

Suppose you could have “distributed” struct/class definitions in an offset-based language; you could write “dstruct SpamCues { ViagraCue viagra; CialisCue cialis; }” in the Medication spam filter module, and “dstruct SpamCues { FallicSymbolsCue fallic; SizeDescriptionsCue size; }” in the Penis Enlargement module. The structure is thus defined by all modules linked into the application.

When someone gets a SpamCues structure and accesses cues.viagra, the compiler generates a load-from-pointer-with-offset instruction - for example, in MIPS assembly it’s spelled “lw offset(ptrreg)”. However, the offset would be left for the linker to resolve, just the way it’s done today for pointers in “move reg, globalobjectlabel” and “jump globalfunclabel”.

This way, access to “distributed” structures would be as fast as “normal” structures. And you would preserve most optimizations related to adjacent offsets. For example, if your machine supports multiple loads, so a rectangle structure with 4 int members can be loaded to 4 registers with “ldm rectptrreg,{R0-R4}” or something, it could still be done because the compiler would know that the 4 members are adjacent; the only unknown thing would be the offset of the rectangle inside the larger struct.

One issue the linker could have on some architectures is handling very large offsets that don’t fit into the instruction encoding of load-from-pointer-with-offset forms. Well, I’d live even with the dumbest solution where you always waste an instruction to increment a pointer in case the offset is too large. And then you could probably do better than that, similarly to the way “far calls” (calls to functions at addresses too far from the point of call for the offset to fit into 28 bits or whatever the branch offset encoding size is on your machine) are handled today.

The whole thing can fail in presence of dynamic loading during program run as in dlopen/LoadLibrary; if you already have objects of the structure, and your language doesn’t support relocation because of using native pointers, then the dynamically loaded module won’t be able to add members to a dstruct since it can’t update the existing objects. Well, I can live with that limitation.

If the language generates native object files, there’s the problem of maintaining compatibility with the object file format. I think this could “almost” be done, by mapping a distributed structure to a custom section .dstruct.SpamCues, and implementing members (viagra, cialis, fallic, size) as global objects in that section. Then if an equivalent of a linker script says that the base address of .dstruct.SpamCues is 0, then &viagra will resolve to the offset of the member inside the structure. The change to automatically map sections named .dstruct.* to 0 surely isn’t more complicated than the handling of stuff like .gnu.linkonce, inflicted upon us by the idiocy of C++ templates and the likes of them.

And here’s why I’ll probably never see a struct linker:

  • If the language uses a native linker, a small change must be done to that linker in order to handle encodings of load/store instructions in ways it previously didn’t (currently it only has to deal with resolving pointers, not offsets). Since it’s platform-specific, the small change is actually quite large.
  • You could compromise and avoid that change by generating less efficient code which uses the already available linker ability to resolve the “address” of the viagra object in the zero-based .dstruct.SpamCues section - the code can add that “address” (offset, really) to &cues. But that could still force changes in the compiler back-end because now it has to generate assembly code adding what looks like 2 addresses, which makes no sense today and might be unsupported if the back-end preserves type information.
  • The previous items assume that the portable “front-end” work to support something like dstruct isn’t a big deal. However, I’d guess that not enough people would benefit from it/realize they’d benefit from it to make it appear in a mainstream language and its front-ends.
  • I could roll my own compiler to a language similar to a mainstream one, with a bunch of additions like this struct linker thingie. Two problems with this. One - it’s too hard to parse all the crud in a mainstream language (even if it isn’t C++) to make it worth the trouble, unless your compiler does something really grand; a bunch of nice features probably aren’t worth it. Two - most programmers take a losing approach towards their career where they want to put mainstream languages on their resume so that losers at the other end can scan their resumes for those languages; if your code is spelled in a dialect, you’ll scare off the losers forming the backbone of our industry.

It still amazes me how what really matters isn’t what can be done, but what’s already done. It’s easier to change goddamn hardware than it is to change “software infrastructure” like languages, software tools, APIs, protocols and all kinds of that shit. I mean, here we have something that’s possible and even easy to do, and yet completely impractical.

Guess I’ll have to roll my own yet-another-distributed-reflective-registration bullshit. Oh well.

The cardinal programming jokes

I’m depressed. What I’ll do is I’ll tell you the 3 cardinal programming jokes. And if it helps cheer me up, I’ll consider my job well done.

I must warn you about those jokes. Firstly, they are translated from Russian and Hebrew by yours truly, which may cause them to lose some of their charm. Secondly, I’m not sure they came with that much charm to begin with, because my taste in jokes (or otherwise) can be politely characterized as “lowbrow”. In particular, all 3 jokes are based on the sewer/plumber metaphor. I didn’t consciously collect them based on this criterion, it just turns out that I can’t think of a better metaphor for programming.

By the way, I was recently told by a very strong programmer that of all things, he wanted to become a plumber as a kid. ‘Cause it was very interesting to him, the tools, the pipes, how you make the whole thing work. And then he felt he understood enough of it, so he figured he’d become a programmer instead. And now he is, and he has enough (virtual) pipes full of (virtual) shit to keep him curious about how to make it work for the rest of his life. By which I mean to say, hey, it’s not just my bad taste, it is a good metaphor, see?

So, the jokes. Lowbrow, depressing stuff. You have been warned.

Expanding your skill set

A very important thing. You should be learning stuff. Yada yada.

With many things though, people have this strange tendency to avoid knowing them, and instead ask someone else unfortunate enough to already know them. Say, Makefiles. Is it just my experience or do people worldwide pretend to be incapable of dealing with a hairy Makefile, and leave its regularly scheduled tweaking to a small set of knowledgeable victims?

Or debugging of the lowest kind, with race conditions and creative memory corruption. People like to give up on that, as long as someone else can take over. “I just don’t know how to proceed”. Right.

Sometimes I wish I could put this claim to a test. Check if they’d say this at gunpoint. Or, more humanely and therefore much less cheaply, propose them $1M if they do know how to proceed. I bet they’d think a bit harder. If you’re working on AI, specifically on preparing it to the Turing test, don’t forget to teach it this principle, or else it has no chance of passing for a human.

I find that the following describes the double-edged sword that is skill set expansion quite well:

A plumber and his apprentice pay a visit to a manhole requiring their attention. The plumber goes down the manhole, and the apprentice stays above with the toolbox. The plumber asks for wrench #3, and the apprentice puts the wrench into his hand. 2 minutes pass. “Wrench #5!” The apprentice finds the wrench and passes it to the plumber. 5 more minutes. “Wrench #6!” The plumber is given that, takes a couple more minutes and finally comes out.

The next scene should really be a small piece of pantomime, but I’ll have to get by with words alone. Not unexpectedly for this type of joke, the plumber comes out with his arms covered with excrement. He slowly sweeps his right hand over his left arm, then the left hand over the right arm, shakes his hands and reaches for something to wipe them with. And to the apprentice he says:

“Watch and learn, son, or you’ll be passing wrenches for the rest of your life”.

Really, you should learn things. Expand your skill set. Who wants to be passing wrenches?

Layers of abstraction

Abstraction is good. Or should I say legitimate. Or should I say inevitable. I mean, you have to count on something. Something has to work, because you can’t build things on top of nothing.

Except it won’t work. That something you build things on top of won’t work.

What’s that? “Whining”? Yep, definitely. This here is whining.

Whining is good. Or should I say legitimate. Or should I say inevitable. Because if you aren’t allowed to whine about frigging data channels which drop chunks of data and duplicate chunks of data because some fucking hardware subcontructor couldn’t be bothered to implement arbitration for shared data access, if you aren’t allowed to whine about that…

If you aren’t allowed to whine about that, you should be allowed to whine about memory, which flips bits, and zeros bytes, and it does so once per hour for some weird sequence of accesses having nothing to do with the address where data actually changes. Fuck that, OK? Fuck DDR2. Fuck its controllers and the zillions of their configuration parameters.

A plumber climbs out of a manhole, this time without a preamble, and his arms are covered with - guess what? - excrement! A beautiful little girl in a beautiful white dress happens to pass by. The plumber seizes the opportunity and (another piece of pantomime) quickly, but firmly sweeps his hands over the girl’s white dress.

Little girl (appalled): AAAH!!

Plumber (outraged): Oh yeah? I bet you love to take a shit though.

Yep. You love to allocate objects in memory, don’t you? Megabytes of them. And then a board designer decides to wipe his filthy hands with your beautiful white huge software system. Debug that, you perverted memory-addicted individual.

Taking pride in your work

And still, I actually like my work, on a level. Why? It feels inherently cool to design stuff that becomes this bunch of tiny parts, transistors and all, switching hundreds of millions of times a second, and then to write code that manages all the flying circus.

I know people who feel the same about computer vision. People for whom it’s a personal priority to work on computer vision, where they are given images and they look for stuff in them. Who wants to be doing that? Who wants to be responsible for the solution of a problem that can’t even be precisely defined? Me, I wanna be doing hardware.

What do I actually do most of the time though? I eat hexadecimal. I sit near a debugger, and I keep hitting Page Up in a memory view window, to find the beginning of the array that overwrote this piece of data (I guess the element size from the repetitive patterns and such), and along comes a computer vision geek and he says, “damn it, man, you got out of the Matrix!”

Well, I dunno, I find it much easier to guess what buggy code did to my memory than to find out “why” an algorithm thinks this here is a person when in fact it’s a shade of a tree. Because if you look closely at the pixels, the shade kinda looks like a person, but of course we could reject it based on its motion, but of course that would mean we’d approve these reflections over here based on their motion, but, but, but…

What my bogus example is saying is that you have lots and lots of cues but each can work both for you and against you, and now how do you weigh all that, without even a formal spec? I’d rather eat hexadecimal, thank you very much.

And we look at each other, and sincerely think that our jobs are pretty nifty, but the other guy’s job is awful and how can he be doing it. And I suspect that if one looks at this from aside, one might wonder where the actual fun is, because there is actual fun in here, or so all the participants testify. And I think I know the answer.

An airplane lands, and passengers come out. One of them notices a guy underneath the airplane. As you’d guess, the guy is a plumber. The plumber touches some lock, and immediately gets covered by excrement streaming from an opening at the bottom of the plane.

The pantomime cleanup routine follows, and then comes the turn of the dialog.

Passenger (appalled): What on Earth makes you keep this job?

Plumber (proudly): Hey, I’m in the aerospace business!

The aerospace effect happens to different people with different things. With some, it’s “Hey, I’m making real hardware!” With others, it’s “Hey, I’m finding real objects in real images!” It’s a good thing people are different, because so are the currents of excrement, and someone ought to swim in each. We can’t all be passing wrenches.

I love globals, or Google Core Dump

The entire discussion only applies to unsafe languages, the ones that dump core. By which I mean, C. Or C++, if you’re really out of luck.

If it can dump core, it will dump core, unless it decides to silently corrupt its data instead. Trust my experience of working in a multi-processor, multi-threaded, multi-programmer, multi-nightmare environment. Some C++ FQA Lite readers claimed that the fact that I deal with lots of crashes in C++ indicates that I’m a crappy programmer surrounded by similar people. Luckily, you don’t really need to trust my experience, because you can trust Google’s. Do this:

  1. Find a Google office near you.
  2. Visit a Google toilet.
  3. You’ll find a page about software testing, with the subtitle “Debugging sucks. Testing rocks.” Read it.
  4. Recover from the trauma.
  5. Realize that the chances of you being better at eliminating bugs than Google are low.
  6. Read about the AdWords multi-threaded billing server nightmare.
  7. The server was written in C++. The bug couldn’t happen in a safe language. Meditate on it.
  8. Consider yourself enlightened.

This isn’t the reason why this post has “Google core dump” in its title, but hopefully it’s a reason for us to agree that your C/C++ program will crash, too.

I love globals

What happens when we face a core dump? Well, we need the same things you’d expect to look for in any investigation: names and addresses. Names of objects looking at which may explain what happened, their addresses to actually look at them, and type information to sensibly display them.

In C and C++, we have 3 kinds of addresses: stack, heap and global. Let’s see who lives there.

Except the stack is overwritten, because it can be. Don’t count on being able to see the function calls leading to the point of crash, nor the parameters and local variables of those functions. In fact, don’t even count on being able to see the point of crash itself: the program counter, the link register, the frame pointer, all that stuff can contain garbage.

And the heap is overwritten, too, nearly as badly. The typical data structure used by C/C++ allocators (for example, dlmalloc) is a kind of linked list, where each memory block is prefixed with its size so you can jump to the next one. Overwrite one of these size values and you will have lost the boundaries of the chunks above that address. That’s a loss of 50% of the heap objects on average, assuming uniform distribution of memory overwriting bugs across the address space of the heap.

So don’t count on the stack or the heap. Your only hope is that someone has ignored the Best Practices and the finger-pointing by the more proficient colleagues, and allocated a global object. Possibly under the clever disguise of a “Singleton”. Not a bad thing after all, that moronic “design pattern”, because it ultimately allowed to counter cargo cult programmers’ accusations of “globals are evil” with equally powerful cargo cult argument of “it’s a design pattern”. So people could allocate globals again.

Which is good, because a global always has an accurate name-to-address mapping, no matter what atrocity was committed by the bulk of unsafe code running on the loose. Can’t overwrite a symbol table. And it has accurate type information, too. As opposed to objects you find through void*, or a base class pointer where the base class lacks virtual functions or the object vptr was overwritten, etc.

Which is why I frequently start debugging by firing an object view window on a global, or running debugger macros which read globals, etc. Of course you can fuck up a global variable to make debugging unpleasant. For example, if the variable is “static” in the C sense, you need to open the right file or function to display it, and you need the debugger front-end to understand the context, which will be especially challenging if it’s a static variable in a template function (one of the best things in C++ is how neatly its new features interact with C’s old ones).

Or you can stuff the global into a class or a namespace. I was never able to display globals by their qualified C++ name in, say, gdb 5. But no matter; nm <program> | grep <global> followed by p *(TypeOfGlobal*)addr always does the trick, and no attempts at obfuscating the symbol table will stop it. I still say make it a real, unashamed global to make debugging easier. If you’re lucky, you’ll get to piss off a couple of cargo cult followers as a nice side-effect.

Google Core Dump

A core dump is a web. Its sites are objects. It’s hyperlinks are pointers. It’s PageRank is a TypeRank: what’s the type of this object according to the votes of the pointers stored in other objects? The spamdexing is done by pointer-like bit patterns stored in unused memory slots. The global variables are the major sites with high availability you can use as roots for the crawling.

What utilities would we like to have for this web? The usual stuff.

  • Browsers. Debugger object view window is the Firefox, and the memory view window is the Lynx. The core dump Lynx usually sucks in that it doesn’t make it easy to follow pointers - can’t click on a word and have the browser follow the pointer (by jumping to the memory pointed by it). No back button, either. Oh well.
  • DNS. The ability to translate variable names to raw addresses. Works very reliably for globals and passably otherwise. Works reliably for all objects in safe languages.
  • Reverse DNS. Given an address, tell me the object name. Problematic for dynamically allocated objects, although you could list the names of pointer variables leading to it (Google bombing). Works reliably for global functions and variables. For some reason, the standard addr2line program only supports functions though. Which is why I have an addr2sym program. It so happened that I have several of them, in fact. You can download one here. “Reverse DNS” is particularly useful when you find pointers somewhere in registers or memory and wonder what they could point to. In safe languages, you don’t have that problem because everything is typed and so you can simply display the pointed object.
  • Google Core Dump, similar to Google Desktop or Google for the WWW. Crawl a core dump, figure out the object boundaries and types by parsing the heap linked list and the stack and looking at pointers’ “votes”, create an index, and allow me to query that index. Lots of work, that, some of it heuristical. And in order to get type information in C or C++, you’ll have to either parse the source code (good luck doing it with C++), or parse the non-portable debug information format. But it’s doable; in fact, we have it, for our particular target/debugger/allocator combo. Of course it has its glitches. Quirky and obscure enough to make open sourcing it not worth the trouble.

I really wish there was a reasonably portable and reliable Google Core Dump kind of thing. But it doesn’t look like that many people care about debugging crashes at all. Most core dumps at customer sites seem to go to /dev/null, and those that can’t be easily deciphered are apparently given up on until the bug manifests itself in some other way or its cause is guessed by someone.

Am I coming from a particularly weird niche where the code size is large enough and the development rapid enough to make crashes almost unavoidable, but crashes in the final product version are almost intolerable? Or do most good projects allocate everything on the stack and the heap, so with those smashed they’re doomed no matter what? Or is the problem simply stinky enough to make it unattractive for a hobby project while lacking revenue potential to make a good commercial project?

Would you like this sort of thing? If you would, drop me a line. In the meanwhile, I satisfy my wish for a Google Core Dump with my perfect implementation for an embedded co-processor, the one I’ve poked at with Tcl commands. With 128K of memory, no dynamic allocation, and local variables effectively implemented as globals, perfect decoding is easy. I’m telling ya, globals rule.

As to my “reverse DNS” implementation:

  • I could make it more portable by parsing the output of nm --print-size. But just running nm on a 20M symbol table takes about 2 seconds. I want instantaneous output, ’cause I’m very impatient when I debug.
  • Alternatively, I could make it more portable by using a library such as bfd. But that would drag in a library such as bfd, and I had trouble with what looked like library/compiler version mismatches with bfd, whereas my ELF parsing code never had any trouble. Also, an implementation parsing ELF is more interesting as sample code because you get to see how easy to parse these formats are. So it’s elfaddr2sym, not addr2sym. (It’s really 32-bit-ELF-with-host-endianness-addr2sym, because I’m lazy and it covers all my targets.)
  • There’s a ton of addr2sym code out there, and maybe a good addr2sym program. I just didn’t find it. I have an acknowledged weakness in the wheel reinventing department.
  • Of course I don’t demangle the ugly C++ names; piping to c++filt does.
  • The program is in D, because of the “instantaneous” bit, and because D is one of the best choices available today if you care about both speed and brevity. Look at this: lowerBound!("a.st_value <= b")(ssyms, addr) does a binary search for addr in the sorted ssyms array. As brief as it gets out of the box with any language and standard library, isn’t it? The string is compiled statically into the instantiation of the lowerBound template; a & b are the arguments of the anonymous function represented by the string. Readable. Short. Fast. Easy to use - garbage-collected array outputs in functions like filter(), error messages to the point - that’s why a decent grammar is a good thing even if you aren’t the compiler writer. Looks a lot like C++, braces, static typing, everything. Thus easy to pimp in a 3GL environment, in particular, a C++ environment. You can download the Digital Mars D compiler for Linux, or wait for C++0x to solve 15% of the problems with <algorithm> by introducing worse problems.

By the way, the std.algorithm module, the one with the sort, filter, lowerBound and similar functions, is by Andrei Alexandrescu, of Modern C++ Design fame. How is it possible that his stuff in D is so yummy while his implementation of similar things in C++ is equally icky? Because C++ is to D what proper fixation is to anaesthesia. There, I bet you saw it coming.

What does “global” mean?

For the sake of completeness, I’d like to bore you with a discussion of the various aspects of globalhood, in the vanishingly small hope of this being useful in a battle against a cargo cult follower authoring a coding convention or such. In C++, “global” can mean at least 6 things:

  • Number of instances per process. A “global” is everything that’s instantiated once.
  • Life cycle. A “global” is constructed before main and destroyed after main. A static variable inside a function is not “global” in this sense.
  • “Scope” in the “namespace” sense (as opposed to the life cycle sense). We have C-style file scope, class scope, function scope, and “the true global scope”. And we have namespaces.
  • Storage. A “global” is assigned a link time address and stored there. In a singleton implementation calling new and assigning its output to a global pointer, the pointer is “global” in this sense but the object is not.
  • Access control. If it’s in a class scope, it may be private or protected, which makes it less of a global in this fifth sense.
  • Responsibility. A global can be accessible from everywhere but only actually accessed from a couple of places. For example, you can allocate a large object instantiating lots of members near your main function and then call object methods which aren’t aware that the stuff is allocated globally.

So when I share my love of globals with you, the question is which aspect of globality I mean. What I mean is this:

  1. I like global storage - link-time addresses - for everything which can be handled that way. A global pointer is better than nothing, but it can be overwritten and you will have lost the object; better allocate the entire thing globally.
  2. I like global scope, no classes, namespaces and access control keywords attached, to make symbol table look-up easier, thus making use of the global allocation.
  3. I like global life cycle - no Meyers’ singletons and lazy initialization. In fact, I like trivial constructors/destructors, leaving the actual work to init/close functions called by main(). This way, you can actually control the order in which things are done and know what the dependencies are. With Meyers’ singletons, the order of destruction is uncontrollable (it’s the reverse order of initialization, which doesn’t necessarily work). Solutions were proposed to this problem, so dreadful that I’m not going to discuss them. Just grow up, design the damned init/close sequence and be in control again. Why do people think that all major operations should be explicit except for initialization which should happen automagically when you least expect it?
  4. “Globals” in the sense of “touched by every piece of code” is the trademark style of a filthy swine. There are plenty of good reasons to use “globals”; none of them has anything to do with “globals” as in “variables nobody/everybody is responsible for”.
  5. I think that everything that’s instantiated once per process is a “global”, and when you wrap it with scope, access control, and design patterns, you shouldn’t stop calling it a global (and instead insist on “singleton”, “static class member”, etc.). It’s still a global, and its wrapping should be evaluated by its practical virtues. Currently, I see no point in wrapping globals in anything - plain old global variables are the thing best supported by all software tools I know.

I think this can be used as “rationale” in a coding guideline, maybe in the part allowing the use of globals as an “exception”. But I keep my hopes low.