Entries Tagged 'software' ↓

The C++ Sucks Series: petrifying functions

Your IT department intends to upgrade your OS and gives a developer a new image to play with. The developer is generally satisfied, except there’s this one program that mysteriously dumps core. Someone thoughtfully blames differences in system libraries.

Alternative prelude: you have this program and you’re working on a new version. Being generally satisfied with the updates, you send the code overseas. They build it and it mysteriously dumps core. Someone thoughtfully blames differences in the compiler version.

Whatever the prelude, you open the core dump with `gdb app core` and gdb says:

#0  0x080484c9 in main (argc=Cannot access memory at address 0xbf3e7a8c) at main.cpp:4
4    int main(int argc, char** argv)
(gdb)

Check out the garbage near “argc=” - if it ain’t printing garbage, it ain’t a C++ debugger. Anyway, it looks like the program didn’t even enter main. An alert C++ hater will immediately suspect that the flying circus happening in C++ before main could be at fault, but in this case it isn’t. In fact, a program can be similarly petrified by the perspective of entering any function, not necessarily main. It’s main where it crashes in our example because the example is small; here’s the source code:

#include <stdio.h>
#include "app.h"

int main(int argc, char** argv)
{
  if(argc != 2) {
    printf("please specify a profile\n");
    return 1;
  }
  const char* profile = argv[1];
  Application app(profile);
  app.mainLoop();
}

On your machine, you run the program without any arguments and sure enough, it says “please specify a profile”; on this other machine, it just dumps core. Hmmm.

Now, I won’t argue that C++ isn’t a high-level object-oriented programming language since every book on the subject is careful to point out the opposite. Instead I’ll argue that you can’t get a first-rate user experience with this high-level object-oriented programming language if you don’t also know assembly. And with the first-rate experience being the living hell that it is, few would willingly opt for a second-rate option.

For example, nothing at the source code level can explain how a program is so shocked by the necessity of running main that it dumps a core in its pants. On the other hand, here’s what we get at the assembly level:

(gdb) p $pc
$1 = (void (*)(void)) 0x80484c9 <main+20>
(gdb) disass $pc
Dump of assembler code for function main:
0x080484b5 <main+0>:    lea    0x4(%esp),%ecx
0x080484b9 <main+4>:    and    $0xfffffff0,%esp
0x080484bc <main+7>:    pushl  -0x4(%ecx)
0x080484bf <main+10>:    push   %ebp
0x080484c0 <main+11>:    mov    %esp,%ebp
0x080484c2 <main+13>:    push   %ecx
0x080484c3 <main+14>:    sub    $0xa00024,%esp
0x080484c9 <main+20>:    mov    %ecx,-0xa0001c(%ebp)
# we don't care about code past $pc -
# a screenful of assembly elided

What this says is that the offending instruction is at the address main+20. As you’d expect with a Segmentation fault or a Bus error core dump, this points to an instruction accessing memory, specifically, the stack.

BTW I don’t realy know the x86 assembly, but I can still read it thusly: “mov” can’t just mean the tame RISC “move between registers” thing because then we wouldn’t crash, so one operand must spell a memory address. Without remembering the source/destination order of the GNU assembler (which AFAIK is the opposite of the usual), I can tell that it’s the second operand that is the memory operand because there’s an integer constant which must mean an offset or something, and why would you need a constant to specify a register operand. Furthermore, I happen to remember that %ebp is the frame pointer register which means that it points into the stack, however I could figure it out from a previous instruction at main+11, which moves %esp [ought to be the stack pointer] to %ebp (or vice versa, as you could think without knowing the GNU operand ordering - but it would still mean that %ebp points into the stack.)

Which goes to show that you can read assembly while operating from a knowledge base that is not very dense, a way of saying “without really knowing what you’re doing” - try that with C++ library code; but I digress. Now, why would we fail to access the stack? Could it have to do with the fact that we apparenty access it with the offset -0xa0001c, which ought to be unusually large? Let’s have a look at the local variables, hoping that we can figure out the size of the stack main needs from their sizes. (Of course if the function used a Matrix class of the sort where the matrix is kept by value right there in a flat member array, looking at the named local variables mentioned in the program wouldn’t be enough since the temporaries returned by overloaded operators would also have to be taken into account; luckily this isn’t the case.)

(gdb) info locals
# if it ain't printing garbage, it ain't a C++ debugger:
profile = 0xb7fd9870 "U\211?WVS??\207"
app = Cannot access memory at address 0xbf3e7a98

We got two local variables; at least one must be huge then. (It can be worse in real life, main functions being perhaps the worst offenders, as many people are too arrogant to start with an Application class. Instead they have an InputParser and an OutputProducer and a Processor, which they proudly use in a neat 5-line main function - why wrap that in a class, 2 files in C++-land? Then they add an InputValidator, an OutputFormatConfigurator and a ProfileLoader, then less sophisticated people gradually add 20 to 100 locals for doing things right there in main, and then nobody wants to refactor the mess because of all the local variables you’d have to pass around; whereas an Application class with two hundred members, while disgusting, at least makes helper functions easy. But I digress again.)

(gdb) p sizeof profile
$2 = 4
(gdb) p sizeof app
$3 = 10485768

“10485768″. The trouble with C++ debuggers is that they routinely print so much garbage due to memory corruption, debug information inadequacy and plain stupidity that their users are accustomed to automatically ignore most of their output without giving it much thought. In particular, large numbers with no apparent regularity in their digits are to a C++ programmer what “viagra” is to a spam filter: a sure clue that something was overwritten somewhere and the number shouldn’t be trusted (I rarely do pair programming but I do lots of pair debugging and people explicitly shared this spam filtering heuristic with me).

However, in this case overwriting is unlikely since a sizeof is a compile time constant stored in the debug information and not in the program memory. We can see that the number will “make more sense” in hexadecimal (which is why hex is generally a good thing to look at before ignoring “garbage”):

(gdb) p /x sizeof app
$4 = 0xa00008

…Which is similar to our offset value, and confirms that we’ve been debugging a plain and simple stack overflow. Which would be easy to see in the case of a recursive function, or if the program crashed, say, in an attempt to access a large local array. However, in C++ it will crash near the beginning of a function long before the offending local variable is even declared, in an attempt to push the frame pointer or some such; I think I also saw it crash in naively-looking places further down the road, but I can’t reproduce it.

Now we must find out which member of the Application class is the huge one, which is lots of fun when members are plentiful and deeply nested, which, with a typical Application class, they are. Some languages have reflection given which we could traverse the member tree automatically; incidentally, most of those languages don’t dump core though. Anyway, in our case finding the problem is easy because I’ve made the example small.

(I also tried to make it ridiculous - do you tend to ridicule pedestrian code, including your own, sometimes as you type? Few do and the scarcity makes them very dear to me.)

class Application
{
 public:
  Application(const char* profile);
  void mainLoop();
 private:
  static const int MAX_BUF_SIZE = 1024;
  static const int MAX_PROF = 1024*10;
  const char* _profPath;
  char _parseBuf[MAX_BUF_SIZE][MAX_PROF];
  Profile* _profile;
};

This shows that it’s _parseBuf that’s causing the problem. This also answers the question of an alert C++ apologist regarding all of the above not being special to C++ but also relevant to C (when faced with a usability problem, C++ apologists like to ignore it and instead concentrate on assigning blame; if a problem reproduces in C, it’s not C++’s fault according to their warped value systems.) Well, while one could write an equivalent C code causing a similar problem, one is unlikely to do so because C doesn’t have a private keyword which to a first approximation does nothing but is advertised as an “encapsulation mechanism“.

In other words, an average C programmer would have a createApplication function which would malloc an Application struct and all would be well since the huge _parseBuf wouldn’t land on the stack. Of course an average C++ programmer, assuming he found someone to decipher the core dump for him as opposed to giving up on the OS upgrade or the overseas code upgrade, could also allocate the Application class dynamically, which would force him to change an unknown number of lines in the client code. Or he could change _parseBuf’s type to std::vector, which would force him to change an unknown number of lines in the implementation code, depending on the nesting of function calls from Application. Alternatively the average C++ programmer could change _parseBuf to be a reference, new it in the constructor(s) and delete it in the destructor, assuming he can find someone who explains to him how to declare references to 2D arrays.

However, suppose you don’t want to change code but instead would like to make old code run on the new machine - a perfectly legitimate desire independently of the quality of the code and its source language. The way to do it under Linux/tcsh is:

unlimit stacksize

Once this is done, the program should no longer dump core. `limit stacksize` would show you the original limit, which AFAIK will differ across Linux installations and sometimes will depend on the user (say, if you ssh to someone’s desktop, you can get a lesser default stacksize limit and won’t be able to run the wretched program). For example, on my wubi installation (Ubuntu for technophopes like myself who have a Windows machine, want a Linux, and hate the idea of fiddling with partitions), `limit stacksize` reports the value of 8M.

Which, as we’ve just seen, is tiny.

Coding standards: having more errors in code than code

I ran LINT version 9, configured to report the violations of the rules in the MISRA C++ 2008 coding standard, on a C++ source file. LINT is perhaps the most famous tool for statically checking C and C++ source code. MISRA stands for the Motor Industry Software Reliability Association, mandating adherence to its coding standards throughout the automotive industry.

The source file I tried has several KLOC worth of code, and the output of the preprocessor takes about 1M - pretty normal for C++ where a “Hello, world!” program generates 3/4M of preprocessed output. The output of LINT takes 38M. That’s 38x more errors than code.

We’re not finished parsing this output so I’m not sure which rules cause most violations and whether they can be clustered somehow to compress the 38M into something resembling comprehensible narrative in contents and size. The only thing basic attempts at parsing revealed at this point is that the distribution of the violations is roughly geometric, with the majority of the errors reporting violations of a minority of the rules.

Therefore, my only way of conveying some insight into the MISRA rules enforced by LINT is to look at a toy example. My example will be a Hello, world program - 2 LOC or 3/4M worth of code depending on your perspective. I’ll assume LINT is told to ignore standard libraries, so it will actually be closer to 2 LOC.

#include <iostream>
int main() { std::cout << "Hello, world" << std::endl; }

From this program, LINT will produce 4 error messages when configured to enforce MISRA C++ 2008:

  1. The “int” in “int main” violates an advisory rule to avoid using built-in types and instead use typedefs indicating the size and signedness of the type, such as int32_t, INT or signed32T. Many an automotive project use a mixture of 2 or 3 of these conventions, which is compliant with the MISRA guidelines and presumably results from the history of merging or integrating code bases and/or teams. (I believe that in the particular case of main, the C and C++ standards both mandate the use of int; I didn’t check if you can use a typedef to spell int but I’m certain that you can’t have main() return an int32_t on a platform where int is 16b. Anyway, it appears that LINT doesn’t bother to special-case main() - but you can do that yourself in its configuration file or right there in the source code, as you will have to do in many other cases.)
  2. The first left shift operator violates a MISRA rule disallowing the use of bitwise shift on signed types, or so it does according to LINT, which presumably checks whether the operands are of an unsigned integral type and reports an error if they are not (the other option is that it figures an output stream or a literal character array are “signed”, but I can’t see how they can be unless it’s a signature we’re talking about rather than signedness). The MISRA rule is based on the fact that the behavior of bitwise shift is implementation-defined and thus not portable. I do believe that there does not exist a 32b machine which does not use the 2’s complement representation for integers and is a target of an automotive application. A notable share of automotive applications use signed integers to represent fixed point numbers, and I believe all of them rely on the 2’s complement semantics of bitwise shifts to emulate multiplication and division.
  3. The second left shift operator is reported as violating the same rule.
  4. The two left shift operators as a whole are reported to violate the rule disallowing dependence on C operator precedence. That is, in order to correctly understand this program, a reader would have to know that (std::cout << “Hello, world!”) would be evaluated first and then its output would be shifted to the left by std::endl. MISRA strives to prevent confusion, based on a well-founded assumption that few programmers know the rules of operator precedence and evaluation order, and LINT enforces the rules defined based on these premises.

I hope this gives some insight on the general code/errors ratio.

Humans and compilers need each other: the VLIW SIMD case

The state of the art in optimizing compilers today is such that for optimizing code, you need (1) a strong optimizing compiler and (2) a strong optimizing human. My rule of thumb is that (1) alone will yield 2x to 10x slower code. This is also what a person selling a (great) compiler “giving 80% of the optimal performance with no manual intervention” once told off-record to a roomful of programmers who pressed him into a corner, elevating my rule of thumb to a nobler plane of anecdotal evidence.

Now, I claim that this situation will persist, and in this post I’ll try to close the fairly large gap between this claim and the mere acknowledgment of what the state of the art is today. The gap is particularly large for the believer in the possibility of strong AI - and while my position is a bit different, I do believe in fairly strong AI (can I say that? people keep telling that I can’t say “nearly context-free”. oh well.)

I realize that many people experienced in optimization feel that, on the contrary, there’s in fact no gap large enough to justify an attempt as boringly rigorous (for a pop tech blog) at proving what they think is obvious as will shortly follow. But I think that many language geek discussions could benefit from a stronger bound on the power of a Sufficiently Smart Compiler than can be derived from (necessarily vague) doubts on the power of AI, and in this post I’ll try to supply such a bound. I actually think a lot of (mainly domain-specific) things could be achieved by AI-ish work on compilation - closer to “identify bubble-sort and convert to quick-sort” than to traditional “analyze when variables are alive and assign them to registers” - and this is why it’s useful to have a feeling when not to go there.

So, consider chess, where the state of the art is apparently quite similar to that in optimization: a strong human player using a strong computer program will take out both a human and a computer playing alone. However, it is conceivable that a program can be developed that doesn’t need the help of a human, being able of completely simulating human thought processes or instead alternative processes which are consistently superior. Why can’t it be the same with optimizing compilers?

(Chess and optimization are similar in another respect - few care about them; I readily acknowledge the insignificance of a 10x speed-up in a continuously expanding set of circumstances, I just happen to work in an area where it does count as a checkmate.)

I’ll try to show that optimization is a fundamentally different game from chess, quite aside from the formal differences such as decidability. I’ll use optimizing for VLIW SIMD processors to show where compilers outperform humans and vice versa. I’ll be quoting a book by the inventor of VLIW called “Embedded Computing: A VLIW Approach” to support my position on the relative strength of humans and compilers in these cases. I’ll then try to show that my examples are significant outside the peculiarities of current hardware, and attempt to state the general reason why humans are indispensable in optimization.

VLIW SIMD

First, we’ll do the acronym expansion; skip it if you’ve been through it.

VLIW stands for “Very Long Instruction Word”. What it really means is that your target processor can be told to execute several instructions in parallel. For example: R0=Add R1,R2 and R3=Mul R0,R1 and R1=Shift R5,R6. For this to work, the processor ought to be able to add, multiply and shift in parallel, that is, its execution hardware must be packed into several units, each getting distinct inputs. The units can be completely symmetric (all supporting the same operations); more often, different units support different instruction sets (so, for example, only one unit in a processor can multiply, but two of them can add, etc.) A stinky thing to note about VLIW instructions is the register semantics. In the example instruction above, R0 is mentioned both as an input and as an output. When it’s mentioned as an input of Mul its old value is meant, and not the value computed by Add. This is somewhat natural since the whole point is to run Add and Mul in parallel so you don’t want Mul to wait for Add; but it’s confusing nonetheless. We’ll come back to this shortly.

SIMD stands for “Single Instruction, Multiple Data” and is known much more widely than VLIW, being available at desktop and server processor architectures like x86 and PowerPC (VLIW reigns the quieter embedded DSP domain, the most commercially significant design probably being TI’s C6000 family.) SIMD means that you have commands like R0=Add8 R1,R2, which does 8 additions given 2 inputs. The registers are thus treated as vectors of numbers - for example, uint8[16], or uint16[8], or uint32[4], assuming 16b registers. This establishes a preference for lower-precision numbers since you can pack more of them into a register and thus process more of them at a time: with uint16, you use Add8, but with uint8, you get to use the 2x faster Add16. We’ll come back to this, too.

Optimizing for VLIW targets

The basic thing at which VLIW shines is the efficient implementation of “flat” loops (where most programs spend most time); by “flat”, I mean that there are no nested if/elses or loops. The technique for implementing loops on VLIW machines is called modulo scheduling. The same technique is used on superscalar machines like modern x86 implementations (the difference from VLIWs being the instruction encoding semantics).

Since I couldn’t find a good introductory page to link to, we’ll run through a basic example of modulo scheduling right here. The idea is pretty simple, although when I first saw hardware designers doing it manually in a casual manner, I was deeply shocked (they do it for designing new hardware rather than programming existing hardware but it’s the same principle).

Suppose you want to compute a[i]=b[i]*c+d on a VLIW processor with 4 units, 2 of them capable of load/store operations, 1 with an adder and 1 with a multiplier. All units have single-cycle latency (that is, their output is available to the next instruction; real VLIW units can have larger latencies, so that several instructions will execute before the result reaches the output register.) Let’s assume that Load and Store increment the pointer, and ignore the need to test for the exit condition through the loop. Then a trivial assembly implementation of a[i]=b[i]*c+d looks like this:

LOOP:
R0=Load b++
R1=Mul R0,c
R2=Add R1,d
Store a++,R2

This takes 4 cycles per iteration, and utilizes none of the processor’s parallelism as each instruction only uses 1 of the 4 execution units. Presumably we could do better; in fact the upper bound on our performance is 1 cycle per iteration, since no unit has to be used more than once to implement a[i]=b[i]*c+d (if we had two multiplications, for example, then with only 1 multiplying unit the upper bound would be 2 cycles/iteration.)

What we’ll do now is blithely schedule all of the work to a single instruction, reaching the throughput suggested by our upper bound:

LOOP:
R0=LOAD b++ and R1=MUL R0,c and R2=ADD R1,d and STORE a++,R2

Let’s look at what this code is doing at iteration N:

  • b[N] is loaded
  • b[N-1] (loaded at the previous iteration into R0) is multiplied by c
  • b[N-2]*c (computed at the previous iteration from the old value of R0 and saved to R1) is added to d
  • b[N-3]*c+d is saved to a[N]

This shows why our naive implementation doesn’t work (it would be quite surprising if it did) - at iteration 0, b[N-1] to b[N-3] are undefined, so it makes no sense to do things depending on these values. However, starting at N=3, our (single-instruction) loop body seems to be doing its job just fine (except for storing the result to the wrong place - b ran away during the first 3 iterations). We’ll take care of the first iterations by adding a loop header - instructions which implement the first 3 iterations, only doing the stuff that makes sense in those iterations:

R0=Load b++
R0=Load b++ and R1=Mul R0,c
R0=Load b++ and R1=Mul R0,c and R2=Add R1,d
LOOP:
R0=Load b++ and R1=Mul R0,c and R2=Add R1,d and Store a++,R2

For similar reasons, we need a loop trailer - unless we don’t mind loading 3 elements past the end of a[], but I reckon you get the idea. So we’ll skip the trailer part, and move to the more interesting case - what happens when the loop body won’t fit into a single instruction. To show that, I can add more work to be done in the loop so it won’t fit into the units, or I can use a weaker imaginary target machine to do the same work which will no longer fit into the (fewer) units. The former requires more imaginary assembly code, so I chose the latter. Let’s imagine a target machine with just 2 units, 1 with Load/Store and one with Add/Mul. Then our upper bound on performance is 2 cycles per iteration. The loop body will look like this:

LOOP:
R0=Load b++ and R2=Add R1,d
R1=Mul R0,c and Store a++,R2

Compared to the single-instruction case, which was still readable (”Load and Mul and Add and Store”), this piece looks garbled. However, we can still trace its execution and find that it works correctly at iteration N (assuming we added proper header code):

  • At instruction 1 of iteration N, b[N] is loaded
  • At instruction 2 of iteration N, b[N] (loaded to R0 by instr 1 of iter N) is multiplied by c
  • At instruction 1 of iteration N, b[N-1]*c (computed in R1 by instr 2 of iter N-1) is added to d
  • At instruction 2 of iteration N, b[N-1]*c+d (computed in R2 by instr 1 of iter N) is stored to a[N]

In common VLIW terminology, the number of instructions in the loop body, known to the rest of humanity as “throughput”, is called “initiation interval”. “Modulo scheduling” is presumably so named because the instructions implementing a loop body are scheduled “modulo initiation interval”. In our second example, the operations in the sequence Load, Mul, Add, Store go to instructions 0,1,0,1 = 0%2,1%2,2%2,3%2. In our first example, everything goes to i%1=0 - which is why I needed an example with at least 2 instructions in a loop, “modulo 1″ being a poor way to illustrate “modulo”.

In practice, “modulo scheduling” grows more hairy than simply computing the initiation interval, creating a linear schedule for your program and then “wrapping it around” the initiation interval using %. For example, if for whatever reason we couldn’t issue Mul and Store at the same cycle, we could still implement the loop at the 2 cycles/iteration throughput, but we’d have to move the Mul forward in our schedule, and adjust the rest accordingly.

I’ve done this kind of thing manually for some time, and let me assure you that fun it was not. An initiation interval of 3 with 10-15 temporary variables was on the border of my mental capacity. Compilers, on the other hand, are good at this, because you can treat your input program as a uniform graph of operations and their dependencies, and a legal schedule preserving its semantics is relatively easy to define. You have a few annoyances like pointer aliasing which precludes reordering, but it’s a reasonably small and closed set of annoyances. Quoting “Embedded Computing: A VLIW Approach” (3.2.1, p. 92): “All of these problems have been solved, although some have more satisfyingly closed-form solution than others.” Which is why some people with years of experience on VLIW targets know almost nothing about modulo scheduling - a compiler does a fine job without their help.

The book goes on to say that “Using a VLIW approach without a good compiler is not recommended” - in other words, a human without a compiler will not perform very well. Based on my experience of hand-coding assembly for a VLIW, I second that. I did reach about 95% of the performance of a compiler that was developed later, but the time it took meant that many optimizations just wouldn’t fit into a practical release schedule.

Optimizing for SIMD targets

I will try to show that humans optimize well for SIMD targets and compilers don’t. I’ll quote “Embedded Computing: A VLIW Approach” more extensively in this section. A book on VLIW may not sound like the best source for insight on SIMD, however, I somewhat naturally haven’t heard of a book on SIMD stressing how compilers aren’t good at optimizing for it. But then I haven’t heard of a book stressing the opposite, either, and success papers I saw claimed at automatic vectorization was modest. Furthermore, the particular VLIW book I quote is in fact focusing on embedded DSP where SIMD is ubiquitous, and its central theme is the importance of designing processors in ways making them good targets for optimizing compilers. It sounds like a good place to look for tips on designing compilers to work well with SIMD and vice versa; and if they say they have no such tips, it’s telling.

And in fact the bottom line of the discussion on SIMD (which they call “micro-SIMD”) is fairly grim: “The ability of compilers to automatically extract micro-SIMD without hints (and in particular, without pointer alignment information) is still unproven, and manual code restructuring is still necessary to exploit micro-SIMD parallelism” (4.1.4, p. 143). This statement from 2005 is consistent with what (AFAIK) compilers can do today. No SIMD-targeted programming environment I know relieves you of the need to use intrinsics in your C code as in “a = Add8(b,c)”, where Add8 is a built-in function-looking operator translated to a SIMD instruction.

What I find fascinating though is the way they singled out pointer alignment as a particularly interesting factor necessitating “hints”. Sure, most newbies to SIMD are appalled when they find out about the need to align pointers to 16 bytes if you want to use instructions accessing 16 bytes at a time. But how much of a show-stopper can that be if we are to look at the costs and benefits more closely? Aligning pointers is easy, producing run time errors when they aren’t is easier, telling a compiler that they are can’t be hard (say, gcc has a __vector type modifier telling that), and alternatively generating two pieces of code - optimized for the aligned case and non-optimized for the misaligned case - isn’t hard, either (the book itself mentions still other option - generating non-optimized loop header and trailer for the misaligned sections of an array).

There ought to be more significant reasons for people to be uglifying their code with non-portable intrinsics, and in fact there are. The book even discusses them in the pages preceeding the conclusion - but why doesn’t it mention the more serious reasons in the conclusion? To me this is revealing of the difference between a programmer’s perspective and a compiler writer’s perspective, which is related to the difference between optimization and chess: in chess, there are rules.

For an optimizing programmer, SIMD instructions are a resource from which most benefit must be squeezed at any reasonable cost, including tweaking the behavior of the program. For an optimizing compiler, SIMD instructions are something that can be used to implement a piece of source code, in fact the preferable way to implement it - as long as its semantics are preserved. This means that a compiler obeys rules a programmer doesn’t, making winning impossible. A typical reaction of a compiler writer is to think of this as not his problem - his problem ending where program transformations preserving the semantics are exhausted. I think this is what explains the focus on things like pointer alignment (which a compiler can in fact solve with a few hints and without affecting the results of the program) at the expense of the substantive issues (which it can’t).

In the context of SIMD optimizations, the most significant example of rules obeyed by just one of the contestants has to do with precision, which the book mentions right after alignment in its detailed discussion of the problems with SIMD. “Even when we manipulate byte-sized quantities (as in the case of most pixel-based images, for example), the precision requirements of the majority of manipulation algorithms require keeping a few extra bits around (9, 12, and 16 are common choices) for the intermediate stages of an algorithm. …this forces us up to the next practical size of sub-word … reducing the potential parallelism by a factor of two up front.” They go on to say that a 32b register will end up keeping just 2 16b numbers, giving a 2x speed-up - modest considering all the cases when you won’t get even that due to other obstacles.

This argument shows the problems precision creates for the hardware implementation of SIMD. However, the precision of intermediate results isn’t as hard a problem as this presentation makes it sound, because intermediate results are typically kept in registers, not in memory. So to keep the extra bits in intermediate results, you can either use large registers for SIMD operations and not “general-purpose” 32b ones, or you can keep intermediate results in pairs of registers - as long as you have enough processing units to generate and further process these intermediate results. Both things are done by actual SIMD hardware.

However, the significant problems created by precision lie at the software side: the compiler doesn’t know how many bits it will need for intermediate results, nor when precision can be traded for performance. In C, the type of the intermediate results in the expression (a[i]*3+b[i]*c[i])>>d is int (roughly, 32b), even if a, b and c are arrays of 8b numbers, and the parenthesized expression can in fact exceed 16b. The programmer may know that b[i]*c[i] never exceeds, say, 20000 so the whole thing will fit in 16b. That C has no way of specifying precise ranges of values a variable can hold (as opposed to Lisp, of all rivals to the title of the most aggressively optimizing environment) doesn’t by itself make an argument since a way could be added, just like gcc added __vector, not to mention the option of using a different language. Specifying the ranges of b[i] and c[i] wouldn’t always suffice and we would have to further uglify the code to specify the range of the product (in case both b[i] and c[i] can be large by themselves but never together), but it could be done.

The real problem with having to specify such information to the compiler isn’t the lack of a standard way of spelling it, but that a programmer doesn’t know when to do it. If it’s me who is responsible for the low-level aspects of optimization, I’ll notice the trouble with an intermediate result requiring too many bits to represent. I will then choose whether to handle it by investigating the ranges of b[i] and c[i] and restricting them if needed, by moving the shift by d into the expression as in (a[i]*3>>d)+(b[i]*c[i]>>d) so intermediate results never exceed 16b, or in some other way. But if it’s the compiler who’s responsible, chances are that I won’t know that this problem exists at all.

There’s a trade-off between performance gains, precision losses and the effort needed to obtain more knowledge about the problem. A person can make these trade-offs because the person knows “what the program really does”, and the semantics of the source code are just a rendering of that informal spec from one possible perspective. It’s even worse than that - a person actually doesn’t know what the program really does until an attempt to optimize it, so even strong AI capable of understanding an informal spec in English wouldn’t be a substitute for a person.

A person can say, “Oh, we run out of bits here. OK, so let’s drop the precision of the coefficients.” Theoretically, and importantly for my claim, strong AI can also say that - but only if it operates as a person and not as a machine. I don’t claim that we’ll never reach a point where we have a machine powerful enough to join our team as a programmer, just that (1) we probably wouldn’t want to and (2) if we would, it wouldn’t be called a compiler, it would be called a software developer. That is, you wouldn’t press a button and expect to get object code from your source code, you’d expect a conversation: “Hey, why do you need so many bits here - it’s just a smoothing filter, do you really think anyone will notice the difference? Do you realize that this generates 4x slower code?” And then someone, perhaps another machine, would answer that yes, perhaps we should drop some of the bits, but let’s not overdo it because there are artifacts, and I know you couldn’t care less because your job ends here but those artifacts are amplified when we compute the gradient, etc.

This is how persons optimize, and while a machine could in theory act as a person, it would thereby no longer be a compiler. BTW, we have a compiler at work that actually does converse with you - it says that it will only optimize a piece of code if you specify that the minimal number of iterations executed is such and such; I think it was me who proposed to handle that case using conversation. So this discussion isn’t pure rhetoric. I really wish compilers had a -warn-about-missed-optimization-opportunities switch that would give advice of this kind; it would help in a bunch of interesting cases. I just think that in some cases, precision being one of them, the amount and complexity of interactions needed to make headway like that exceeds the threshold separating aggressive optimization from aggressive lunacy.

To be sure, there are optimization problems that could be addressed by strong AI. In the case of SIMD, the book mentions one such area - they call it “Pack, Unpack, and Mix”. “Some programs require rearranging the sub-words within a container to deal with the different sub-word layouts. From time to time, the ordering of the sub-words within a word (for example, coming from loading a word from memory) does not line up with the parallelism in the code… The only solution is to rearrange the sub-words within the containers through a set of permutation or copying operations (for example, the MIX operation in the HP PA-RISC MAX-2 extension).”

An example of this reordering problem is warping: computing a[i]=b[i*step+shift]. This is impossible to do in SIMD without a permutation instruction of the kind they mention (PowerPC’s AltiVec has vec_perm, and AFAIK x86’s SSE has nothing so you can’t warp very efficiently). However, even if an instruction is available, compilers are AFAIK unable to exploit it. I see no reason why sufficiently strong AI couldn’t manage to do such things with few hints in some interesting cases. I wouldn’t bet my money on it - I side with Mitch Kapor on the Turing Test bet, but it is conceivable like the invincible chess playing program, and unlike transformations requiring “small” changes of the semantics.

Significance

There are areas of optimization that are very significant commercially but hardly interesting in a theoretical discussion (and this here’s a distinctively theoretical discussion as is any discussion where the possibility of strong AI is supposed to be taken into account).

For example, register allocation for the x86 is exceedingly gnarly and perhaps an interesting argument could be made to defend the need for human intervention in this process in extreme cases (I wouldn’t know since I never seriously optimized for the x86). However, a general claim that register allocation makes compiler optimization hard wouldn’t follow from such an argument: on a machine with plentiful and reasonably uniform registers, it’s hard to imagine what a human can do that a compiler can’t do better, and almost everybody would agree that the single reason for not making hardware that way is a commercial one - to make an x86-compatible processor.

Now, I believe that both SIMD and VLIW instruction encodings don’t have this accidental nature, and more likely are part of the Right Way of designing high-performance processors (assuming that it makes no sense to move cost from software to hardware and call that a “performance gain”, that is, assuming that performance is measured per square millimeter of silicon). One argument of rigor worthy of a pop tech blog is that most high-end processors have converged to SIMD VLIW: they have instructions processing short vectors and they can issue multiple instructions in parallel; some do the latter in the “superscalar” way of having the hardware analyze dependencies between instructions at run time and others do it in the “actual VLIW” way of having the lack of dependencies proven and marked by the compiler, but you end up doing modulo scheduling anyway.

However, this can of course indicate uninformed consumer preference rather than actual utility (I type this on a noisy Core 2 Duo box running Firefox on top of XP, a job better handled by a cheaper, silent single-core - and I’m definitely a consumer who should have known better). So my main reasons for believing VLIW and SIMD are “right” are abstract considerations on building von Neumann machines:

  • You typically have lots of distinct execution hardware: a multiplier has little in common with a load/store unit. Up to a point, it will therefore make sense to support parallel execution of instructions on the different execution hardware. The cost of supporting it will be more I/O ports connecting the execution units with the register file - quite serious because of the multiplexers selecting the registers to read/write. However, the cost of not supporting it will be more execution hardware left unused for more time. So the optimum is unlikely to be “no parallel execution”, it’s likely “judicious parallel execution”.
  • It is cheaper to have few wide registers and wide buses between the register file and the execution units than it is to have many narrow registers and buses. That’s because the cost of the register file is proportional to the product of #registers and #buses to the execution units. It is thus significantly cheaper to have 1 unit with 4 8bx8b multipliers and 2 32b buses for the inputs then it is to have 4 units with 1 8bx8b multiplier in each and 8 8b buses for the inputs. It’s also cheaper to keep 4 bytes in 1 32b register than in 4 8b registers. Likewise, it is cheaper to have 4 multipliers in 1 processor than to have 4 full-blown processor cores, because each core would have, say, its own fetch and decode logic and instruction cache - which are in fact pure overhead. So if you have a von Neumann machine with registers and buses and instruction cache, it makes sense (up to a point) to add SIMD to make the best of that investment, and this is why commercial VLIWs have SIMD, although the VLIW theory recommends more units instead.

Since I believe that both VLIW and SIMD are essential for maximizing hardware performance, I also tend to think that optimizations needed to utilize these features are “mainstream” enough to support a broad claim about optimization in general. And the main point of my claim is that compilers can’t win in the optimization game, because part of that game is the ability to change the rules once you start losing.

Humans faced with a program they fail to optimize change the program, sometimes a little, sometimes a lot - I heard of 5×5 filters made 4×4 to run on a DSP. But even if we exclude the truly shameless cheating of the latter kind, the gentler cheating going into every serious optimization effort still requires to negotiate and to take responsibility in a way that a person - human or artificial - can, but a tool like a compiler can not.

Modulo scheduling is an example of the kinds of optimizations which in fact are best left to a compiler - the ones where rules are fixed: once the code is spelled, few can be added by further annotations by the author and hence the game can be won without much negotiations with the author; although sometimes a little interrogation can’t hurt.

The C++ Sucks Series: the quest for the entry point

Suppose you run on the x86 and you don’t like its default FPU settings. For example, you want your programs to dump core when they divide by zero or compute a NaN, having noticed that on average, these events aren’t artifacts of clever numerical algorithm design, but rather indications that somebody has been using uninitialized memory. It’s not necessarily a good idea for production code, but for debugging, you can tweak the x86 FPU thusly:

//this is a Linux header using GNU inline asm
#include <fpu_control.h>
void fpu_setup() {
unsigned short cw;
_FPU_GETCW(cw);
cw &= ~_FPU_MASK_ZM;//Divide by zero
cw &= ~_FPU_MASK_IM;//Invalid operation
_FPU_SETCW(cw);
}

So you call this function somewhere during your program’s initialization sequence, and sure enough, computations producing NaN after the call to fpu_setup result in core dumps. Then one day someone computes a NaN before the call to fpu_setup, and you get a core dump the first time you try to use the FPU after that point. Because that’s how x86 maintains its “illegal operation” flags and that’s how it uses them to signal exceptions.

The call stack you got is pretty worthless as you’re after the context that computed the NaN, not the context that got the exception because it happened to be the first one to use the FPU after the call to fpu_setup. So you move the call to fpu_setup to the beginning of main(), but help it does not. That’s because the offending computation happens before main, somewhere in the global object construction sequence. The order of execution of the global object constructors is undefined by the C++ standard. So if you kindly excuse my phrasing - where should we shove the call to fpu_setup?

If you have enough confidence in your understanding of the things going on (as opposed to entering hair-pulling mode), what you start looking for is the REAL entry point. C++ is free to suck and execute parts of your program in “undefined” (random) order, but a computer still executes instructions in a defined order, and whatever that order is, some instructions ought to come first. Since main() isn’t the real entry point in the sense that stuff happens before main, there ought to be another function which does come first.

One thing that could work is to add a global object to each C++ translation unit, and have its constructor call fpu_setup(); one of those calls ought to come before the offending global constructor - assuming that global objects defined in the same translation unit will be constructed one after another (AFAIK in practice they will, although in theory the implementation could, for example, order the constructor calls by the object name, so they wouldn’t). However, this can get gnarly for systems with non-trivial build process and/or decomposition into shared libraries. Another problem is that compilers will “optimize away” (throw away together with the side effects, actually) calls to constructors of global objects which aren’t “used” (mentioned by name). You can work around that by generating code “using” all the dummy objects from all the translation units and calling that “using” code from, say, main. Good luck with that.

The way I find much easier is to not try to solve this “portably” by working against the semantics prescribed by the C++ standard, but instead rely on the actual implementation, which usually has a defined entry point, and a bunch of functions known to be called by the entry point before main. For example, the GNU libc uses a function called __libc_start_main, which is eventually called by the code at _start (the “true” entry point containing the first executed instruction, AFAIK; I suck at GNU/Linux and only know what was enough to get by until now.) In general, running `objdump -T <program> | grep start` (which looks for symbols from shared libraries - “nm <program>” will miss those) is likely to turn up some interesting function. In these situations, some people prefer to find out from the documentation, others prefer to crawl under a table and die of depression; the grepping individuals of my sort are somewhere in between.

Now, instead of building (correctly configure-ing and make-ing) our own version of libc with __libc_start_main calling the dreaded fpu_setup, we can use $LD_PRELOAD - an env var telling the loader to load our library first. If we trick the loader into loading a shared library containing the symbol __libc_start_main, it will override libc’s function with the same name. (I’m not very good at dynamic loading, but the sad fact is that it’s totally broken, under both Windows and Unix, in the simple sense that where a static linker would give you a function redefinition error, the dynamic loader will pick a random function of the two sharing a name, or it will call one of them from some contexts and the other one from other contexts, etc. But if you ever played with dynamic loading, you already know that, so enough with that.)

Here’s a __libc_start_main function calling fpu_setup and then the actual libc’s __libc_start_main:

#include <dlfcn.h>

typedef int (*fcn)(int *(main) (int, char * *, char * *), int argc, char * * ubp_av, void (*init) (void), void (*fini) (void), void (*rtld_fini) (void), void (* stack_end));
int __libc_start_main(int *(main) (int, char * *, char * *), int argc, char * * ubp_av, void (*init) (void), void (*fini) (void), void (*rtld_fini) (void), void (* stack_end))
{
fpu_setup();
void* handle = dlopen("/lib/libc.so.6", RTLD_LAZY | RTLD_GLOBAL);
fcn start = (fcn)dlsym(handle, "__libc_start_main");
(*start)(main, argc, ubp_av, init, fini, rtld_fini, stack_end);
}

Pretty, isn’t it? Most of the characters are spent on spelling the arguments of this monstrosity - not really interesting since we simply propagate whatever args turned up by grepping/googling for “__libc_start_main” to the “real” libc’s __libc_start_main. dlopen and dlsym give us access to that real __libc_start_main, and /lib/libc.so.6 is where my Linux box keeps its libc (I found out using `ldd <program> | grep libc`).

If you save this to a fplib.c file, you can use it thusly:

gcc -o fplib.so -shared fplib.c
env LD_PRELOAD=./fplib.so <program>

And now your program should finally dump core at the point in the global construction sequence where NaN is computed.

This approach has the nice side-effect of enabling you to “instrument” unsuspecting programs without recompiling them s.t. they run with a reconfigured FPU (to have them crash if they compute NaNs, unless of course they explicitly configure the FPU themselves instead of relying on what they get from the system.) But there are niftier applications of dynamic preloading, such as valgrind on Linux and .NET on Windows (BTW, I don’t know how to trick Windows into preloading, just that you can.) What I wanted to illustrate wasn’t how great preloading is, but the extent to which C++, the language forcing you to sink that low just to execute something at the beginning of your program, SUCKS.

Barf.

Corrections - thanks to the respective commenters for these:

1. Section 3.6.2/1 of the ISO C++ standard states, that “dynamically initialized [objects] shall be initialized in the order in which their definition appears in the translation unit”. So at least you have that out of your way if you want to deal with the problem at the source code level.

2. Instead of hard-coding the path to libc.so, you can pass RTLD_NEXT to dlsym.

Consistency: how to defeat the purpose of IEEE floating point

I don’t know much about the design of IEEE floating point, except for the fact that a lot of knowledge and what they call “intellectual effort” went into it. I don’t even know the requirements, and I suspect those were pretty detailed and complex (for example, the benefits of having a separate representation for +0 and -0 seem hard to grasp unless you know about the very specific and hairy examples in the complex plane). So I don’t trust my own summary of the requirements very much. That said, here’s the summary: the basic purpose of IEEE floating point is to give you results of the highest practically possible precision at each step of your computation.

I’m not going to claim this requirement is misguided, because I don’t feel like arguing with people two orders of magnitude more competent than myself who have likely faced much tougher numerical problems than I’ve ever seen. What I will claim is that differences in numerical needs divide programmers into roughly three camps, and the highest-possible-precision approach hurts one of them really badly, and so has to be worked around in ways we’ll discuss. The camps are:

  1. The huge camp of people who do businessy accounting. Those should work with integral types to get complete, deterministic, portable control over rounding and all that. Many of the clueless people in this camp represent 1 dollar and 10 cents as the floating point number 1.1. While they are likely a major driving force behind economical growth, I still think they deserve all the trouble they bring upon themselves.
  2. The tiny camp doing high-end scientific computing. Those are the people who can really appreciate the design of IEEE floating point and use its full power. It’s great that humanity accidentally satisfied the needs of this small but really cool group, making great floating point hardware available everywhere through blind market forces. It’s like having a built-in Stradivari in each home appliance. Yes, perhaps I exaggerate; I get that a lot.
  3. The sizable camp that deals with low-end to mid-range semi-scientific computing. You know, programs that have some geometry or physics or algebra in them. 99.99% of the code snippets in that realm work great with 64b floating point, without the author having invested any thought at all into “numerical analysis”. 99% of the code works with 32b floats. When someone stumbles upon a piece of code in the 1% and actually witnesses fatal precision loss, everybody gathers to have a look as if they heard about a beautiful rainbow or a smoke suggesting a forest fire near the horizon.

The majority of people who use and actually should be using floating point are thus in camp 3. Those people don’t care about precision anywhere near camp 2, nor do they know how to make the best of IEEE floating point in the very unlikely circumstances where their naive approach will actually fail to work. What they do care about though is consistency. It’s important that things compute the same on all platforms. Perhaps more importantly for most, they should compute the same under different build settings, most notably debug and release mode, because otherwise you can’t reproduce problems.

Side note: I don’t believe in build modes; I usually debug production code in release mode. It’s not just floating point that’s inconsistent across modes - it’s code snippets with behavior undefined by the language, buggy dependence on timing, optimizer bugs, conditional compilation, etc. Many other cans of worms. But the fact is that most people have trouble debugging optimized code, and nobody likes it, so it’s nice to have the option to debug in debug mode, and to do that, you need things to reproduce there.

Also, comparing results of different build modes is one way to find worms from those other cans, like undefined behavior and optimizer bugs. Also, many changes you make are optimizations or refaptorings and you can check their sanity by making sure they didn’t change the results of the previous version. As we’ll see, IEEE FP won’t give you even that, regardless of platforms and build modes. The bottom line is that if you’re in camp 3, you want consistency, while the “only” things you can expect from IEEE FP is precision and speed. Sure, “only” should be put in quotes because it’s a lot to get, it’s just a real pity that fulfilling the smaller and more popular wish for consistency is somewhere between hard and impossible.

Some numerical analysts seem annoyed by the camp 3 whiners. To them I say: look, if IEEE FP wasn’t the huge success that it is in the precision and speed departments, you wouldn’t be hearing from us because we’d be busy struggling with those problems. What we’re saying is the exact opposite of “IEEE FP sucks”. It’s “IEEE FP is so damn precise and fast that I’m happy with ALL of its many answers - the one in optimized x86 build, the one in debug PowerPC build, the one before I added a couple of local variables to that function and the one I got after that change. I just wish I consistently got ONE of these answers, any of them, but the same one.” I think it’s more flattering than insulting.

I’ve accumulated quite some experience in defeating the purpose of IEEE floating point and getting consistency at the (tiny, IMO) cost of precision and speed. I want to share this knowledge with humanity, with the hope of getting rewarded in the comments. The reward I’m after is a refutation of my current theory that you can only eliminate 95%-99% of the pain automatically and have to solve the rest manually each time it raises its ugly head.

The pain breakdown

I know three main sources of floating point inconsistency pain:

  1. Algebraic compiler optimizations
  2. “Complex” instructions like multiply-accumulate or sine
  3. x86-specific pain not available on any other platform; not that ~100% of non-embedded devices is a small market share for a pain.

The good news is that most pain comes from item 3 which can be more or less solved automatically. For the purpose of decision making (”should we invest energy into FP consistency or is it futile?”), I’d say that it’s not futile and if you can cite actual benefits you’d get from consistency, than it’s worth the (continuous) effort.

Disclaimer: I only discuss problems I know and to the extent of my understanding. I have no solid evidence that this understanding is complete. Perhaps the pain breakdown list should have item 4, and perhaps items 1 to 3 have more meat than I think. And while I tried to get the legal stuff right - which behavior conforms to IEEE 754, which conforms to C99, and which conforms to nothing but is still out there - I’m generally a weak tech lawyer and can be wrong. I can only give you the “worked on my 4 families of machines” kind of warranty.

Algebraic compiler optimizations

Compilers, or more specifically buggy optimization passes, assume that floating point numbers can be treated as a field - you know, associativity, distributivity, the works. This means that a+b+c can be computed in either the order implied by (a+b)+c or the one implied by a+(b+c). Adding actual parentheses in source code doesn’t help you one bit. The compiler assumes associativity and may implement the computation in the order implied by regrouping your parentheses. Since each floating point operation loses precision on some of the possible inputs, the code generated by different optimizers or under different optimization settings may produce different results.

This could be extremely intimidating because you can’t trust any FP expression with more than 2 inputs. However, I think that programming languages in general don’t allow optimizers to make these assumptions, and in particular, the C standard doesn’t (C99 §5.1.2.3 #13, didn’t read it in the document but saw it cited in two sources). So this sort of optimization is actually a bug that will likely be fixed if reported, and at any given time, the number of these bugs in a particular compiler should be small.

I only recall one recurring algebraic optimization problem. Specifically, a*(b+1) tended to be compiled to a*b+a in release mode. The reason is that floating-point literal values like 1 are expensive; 1 becomes a hairy hexadecimal value that has to be loaded from a constant address at run time. So the optimizer was probably happy to optimize a literal away. I was always able to solve this problem by changing the spelling in the source code to a*b+a - the optimizer simply had less work to do, while the debug build saw no reason to make me miserable by undoing my regrouping back into a*(b+1).

This implies a general way of solving this sort of problem: find what the optimizer does by looking at the generated assembly, and do it yourself in the source code. This almost certainly guarantees that debug and release will work the same. With different compilers and platforms, the guarantee is less certain. The second optimizer may think that the optimization you copied from the first optimizer into your source code is brain-dead, and undo it and do a different optimization. However, that means you target two radically different optimizers, both of which are buggy and can’t be fixed in the near future; how unlucky can you get?

The bottom line is that you rarely have to deal with this problem, and when it can’t be solved with a bug report, you can look at the assembly and do the optimization in the source code yourself. If that fails because you have to use two very different and buggy compilers, use the shotgun described in the next item.

“Complex” instructions

Your target hardware can have instructions computing “non-trivial” expressions beyond a*b or a+b, such as a+=b*c or sin(x). The precision of the intermediate result b*c in a+=b*c may be higher than the size of an FP register would allow, had that result been actually stored in a register. IEEE and the C standard think it’s great, because the single instruction generated from a+=b*c is both faster and more precise than the 2 instructions implementing it as d=b*c, a=a+d. Camp 3 people like myself don’t think it’s so great, because it happens in some build modes but not others, and across platforms the availability of these instruction varies, as does their precision.

AFAIK the “contraction” of a+=b*c is permitted by both the IEEE FP standard (which defines FP + and *) and the C standard (which defines FP types that can map to standards other than IEEE). On the other hand, sin(x), which also gets implemented in hardware these days, isn’t addressed by either standard - to the same effect of making the optimization perfectly legitimate. So you can’t solve this by reporting a bug the way you could with algebraic optimizations. The other way in which this is tougher is that tweaking the code according to the optimizer’s wishes doesn’t help much. AFAIK what can help is one of these two things:

  1. Ask the compiler to not generate these instructions. Sometimes there’s an exact compiler option for that, like gcc’s platform-specific -mno-fused-madd flag, or there’s (a defined and actually implemented) pragma directive such as #pragma STDC FP_CONTRACT. Sometimes you don’t have any of that, but you can lie to the compiler that you’re using an older (compatible) revision of the processor architecture without the “complex” instructions. The latter is an all-or-nothing thing enabling/disabling lots of stuff, so it can degrade performance in many different ways; you have to check the cost.
  2. If compiler flags can’t help, there’s the shotgun approach I promised to discuss above, also useful for hypothetical cases of hard-to-work-around algebraic optimizations. Instead of helping the optimizer, we get in its way and make optimization impossible using separate compilation. For example, we can convert a+=b*c to a+=multiply_dammit(b,c); multiply_dammit is defined in a separate file. This makes it impossible for most optimizers to see the expression a+=b*c, and forces them to implement multiplication and addition separately. Modern compilers support link-time inlining and then they do optimize the result as a whole. But you can disable that option, and as a side effect speed up linkage a great deal; if that seriously hurts performance, your program is insane and you’re a team of scary ravioli coders.

The trouble with the shotgun approach, aside from its ugliness, is that you can’t afford to shoot at the performance-critical parts of your code that way. Let us hope that you’ll never really have to choose between FP consistency and performance, as I’ve never had to date.

x86

Intel is the birthplace of IEEE floating point, and the manufacturer of the most camp-3-painful and otherwise convoluted FP hardware. The pain comes, somewhat understandably, from a unique commitment to the IEEE FP philosophy - intermediate results should be as precise as possible; more on that in a moment. The “convoluted” part is consistent with the general insanity of the x86 instruction set. Specifically, the “old” (a.k.a “x87″) floating point unit uses a stack architecture for addressing FP operands, which is pretty much the exact opposite of the compiler writer’s dream target, but so is the rest of x86. The “new” floating point instructions in SSE don’t have these problems, at the cost of creating the aesthetic/psychiatric problem of actually having two FP ISAs in the same processor.

Now, in our context we don’t care about the FP stack thingie and all that, the only thing that matters is the consistency of precision. The “old” FP unit handles precision thusly. Precision of stuff written to memory is according to the number of bits of the variable, ’cause what else can it be. Precision of intermediate results in the “registers” (or the “FP stack” or whatever you call it) is defined according to the FPU control & status register, globally for all intermediate results in your program.

By default, it’s 80 bits. This means that when you compute a*b+c*d and a,b,c,d are 32b floats, a*b and c*d are computed in 80b precision, and then their 80b sum is converted to a 32b result in memory (if a*b+c*d is indeed written to memory and isn’t itself an “intermediate” result). Indeed, what’s “intermediate” in the sense of not being written to memory and what isn’t? That depends on:

  1. Debug/release build. If we have “float e=a*b+c*d”, and e is only used once right in the next line, the optimizer probably won’t see a point in writing it to memory. However, in a debug build there’s a good reason to allocate it in memory, because if you single-step your program and you’re already past the line that used e, you still might want to look at the value of e, so it’s good that the compiler kept a copy of it for the debugger to find.
  2. The code “near” e=a*b+c*d according to the compiler’s notion of proximity also affects its decisions. There are only so many registers, and sometimes you run out of them and have to store things in memory. This means that if you add or remove code near the line or in inline functions called near the line, the allocation of intermediate results may change.

Compilers could have an option asking them to hide this mess and give us consistent results. The problems with this are that (1) if you care about cross-platform/compiler consistency, then the availability of cross-mode consistency options in one compiler doesn’t help with the other compiler and (2) for some reason, compilers apparently don’t offer this option in a very good way. For example, MS C++ used to have a /fltconsistency switch but seems to have abandoned it in favor of an insane special-casing of the syntax float(a*b)+float(c*d) - that spelling forces consistency (although the C++ standard doesn’t assign it a special meaning not included in the plain and sane a*b+c*d).

I’d guess they changed it because of the speed penalty it implies rather than the precision penalty as they say. I haven’t heard about someone caring both about consistency and that level of precision, but I did hear that gcc’s consistency-forcing -ffloat-store flag caused notable slowdowns. And the reason it did is implied by its name - AFAIK the only way to implement x86 FP consistency at compile time is to generate code storing FP values to memory to get rid of the extra precision bits. And -ffloat-store only affects named variables, not unnamed intermediate results (annoying, isn’t it?), so /fltconsistency, assuming it actually gave you consistency of all results, should have been much slower. Anyway, the bottom line seems to be that you can’t get much help from compilers here; deal with it yourself. Even Java gave up on its initial intent of getting consistent results on the x87 FPU and retreated to a cowardly strictfp scheme.

And the thing is, you never have to deal with it outside of x86 - all floating point units I’ve come across, including the ones specified by Intel’s SSE and SSE2, simply compute 32b results from 32b inputs. People who decided to do it that way and rob us of quite some bits of precision have my deepest gratitude, because there’s absolutely no good way to work around the generosity of the original x86 FPU designers and get consistent results. Here’s what you can do:

  1. Leave the FP CSR configured to 80b precision. 32b and 64b intermediate results aren’t really 32b and 64b. The fact that it’s the default means that if you care about FP result consistency, intensive hair pulling during your first debugging sessions is an almost inevitable rite of passage.
  2. Set the FP CSR to 64b precision. If you only use 64b variables, debug==release and you’re all set. If you have 32b floats though, then intermediate 32b results aren’t really 32b. And usually you do have 32b floats.
  3. Set the FP CSR to 32b precision. debug==release, but you’re far from “all set” because now your 64b results, intermediate or otherwise, are really 32b. Not only is this a stupid waste of bits, it is not unlikely to fail, too, because 32b isn’t sufficient even for some of the problems encountered by camp 3. And of course it’s not compatible with other platforms.
  4. Set the FP CSR to 64b precision during most of the program run, and temporarily set it to 32b in the parts of the program using 32b floats. I wouldn’t go down that path; option 4 isn’t really an option. I doubt that you use both 32b and 64b variables in a very thoughtful way and manage to have a clear boundary between them. If you depend on the ability of everyone to correctly maintain the CSR, then it sucks sucks sucks.

Side note: I sure as hell don’t believe in “very special” “testing” build/running modes. For example, you could say that you have a special mode where you use option (3) and get 32b results, and use that mode to test debug==release or something. I think it’s completely self-defeating, because the point of consistency is being able to reproduce a phenomenon X that happens in a mode which is actually important, in another mode where reproducing X is actually useful. Therefore, who needs consistency across inherently useless modes? We’d be defeating the purpose of defeating the purpose of IEEE floating point.

Therefore, if you don’t have SSE, the only option is (2) - set the FP CSR to 64b and try to avoid 32b floats. On Linux, you can do it with:

#include <fpu_control.h>
fpu_control_t cw;
_FPU_GETCW(cw);
cw = (cw & ~_FPU_EXTENDED) | _FPU_DOUBLE;
_FPU_SETCW(cw);

Do it first thing in main(). If you use C++, you should do it first thing before main(), because people can use FP in constructors of global variables. This can be achieved by figuring out the compiler-specific translation unit initialization order, compiling your own C/C++ start-up library, overriding the entry point of a compiled start-up library using stuff like LD_PRELOAD, overwriting it in a statically linked program right there in the binary image, having a coding convention forcing to call FloatingPointSingleton::instance() before using FP, or shooting the people who like to do things before main(). It’s a trade-off.

The situation is really even worse because the FPU CSR setting only affects mantissa precision but not the exponent range, so you never work with “real” 64b or 32b floats there. This matters in cases of huge numbers (overflow) and tiny numbers (double rounding of subnormals). But it’s bad enough already, and camp 3 people don’t really care about the extra horror; if you want those Halloween stories, you can find them here. The good news are that today, you are quite likely to have SSE2 and very likely to have SSE on your machine. So you can automatically sanitize all the mess as follows:

  1. If you have SSE2, use it and live happily ever after. SSE2 supports both 32b and 64b operations and the intermediate results are of the size of the operands. BTW, mixed expressions like a+b where a is float and b is double don’t create consistency problems on any platform because the C standard specifies the rules for promotion precisely and portably (a will be promoted to double). The gcc way of using SSE2 for FP is -mfpmath=sse -msse2.
  2. If you only have SSE, use it for 32b floats which it does support (gcc: -mfpmath=sse -msse). 64b floats will go to the old FP unit, so you’ll have to configure it to 64b intermediate results. Everything will work, the only annoying things being (1) the retained necessity to shoot the people having fun before main and (2) the slight differences in the semantics of control flags in the old FP and the SSE FP CSR, so if you configure your own policy, floats and doubles will not behave exactly the same. Neither problem is a very big deal.

Interestingly, SSE with its support for SIMD FP commands actually can make things worse in the standard-violating-algebraic-optimizations department. Specifically, Intel’s compiler reportedly has (had?) an optimization which unrolls FP accumulation loops and reorders additions in order to utilize SIMD FP commands (gcc 4 does that, too, but only if you explicitly ask for trouble with -funsafe-math-optimizations or similar). But I wouldn’t conclude anything from it, except that automatic vectorization, which is known to work only on the simplest of code snippets, actually doesn’t work even on them.

Summary: use SSE2 or SSE, and if you can’t, configure the FP CSR to use 64b intermediates and avoid 32b floats. Even the latter solution works passably in practice, as long as everybody is aware of it.

I think I covered everything I know except for things like long double, FP exceptions, etc. - and if you need that, you’re not in camp 3; go away and hang out with your ivory tower numerical analyst friends. If you know a way to automate away more pain, I’ll be grateful for every FP inconsistency debugging afternoon your advice will save me.

Happy Halloween!

I want a struct linker

Here’s a problem I’ve seen a lot (it’s probably classified as an “Antipattern” or a “Code Smell” and as such has a name in the appropriate circles, but I wouldn’t know, so I’ll leave it nameless).

You have some kind of data structure that you pass around a lot. Soon, the most valuable thing about the structure isn’t the data it keeps, but the fact that it’s available all the way through some hairy flow of control. If you want to have your data flow through all those pipes, just add it to The Data Structure. (To antipattern classification enthusiasts: I don’t think we have a god object yet because we really want to pass our data through that flow and it’s right to have one kind of structure for that and not, say, propagating N+1 function parameters.)

Now suppose the structure holds an open set of data. For example, a spam filter could have a data structure to which various passes add various cues they extract from the message, and other passes can access those cues. We don’t want the structure to know what passes exist and what cues they extract, so that you can add a pass without changing the structure.

I don’t think there’s a good way to do it in a 3GL. In C or C++, you can:

  • Aggregate the cue structures by value (which means you have to recompile everything once you change/add/remove a member from any of them)
  • Keep pointers to the cue structures and use forward declarations to avoid recompilation (a bit slower, and you still have to recompile when you add/remove a whole cue structure)
  • Keep an array of void* or base class objects (not debugger-friendly, and requires a registration procedure to resize the arrays according to the number of passes and deal dynamically computed indexes to the cues to all who wish to access them)
  • Keep a key -> void* map (increasingly slow and debugger-unfriendly, and you need registration to compute the keys from cue names, or use the C substitute for interning - use pointers to global variables with names like &g_my_cue_key as keys)
  • Keep a string -> void* map (no registration or pseudo-interning, but really slow)

On top of JVM or .NET, you have pretty much the same options, plus the option to generate the cue container structure dynamically. Each cue would define an interface and the container structure would implement all those interfaces. The debugger would display containers nicely, and the code accessing them wouldn’t depend on the container class. I’d guess nobody does that though because the class generation part is likely somewhat gnarly.

In a 4GL, you can add attributes to class objects at run time. This is similar to keeping a key->pointer map in a 3GL, except the name interning is handled by the system as it should, and you don’t confuse debuggers because you’re using a standard feature of the object system. Which solves everything except for the speed issue, which is of course dwarfed by other 4GL speed issues.

Now, I used to think of it as one of the usual speed vs convenience trade-offs, but I no longer think it is, because a struct linker could solve it.

Suppose you could have “distributed” struct/class definitions in an offset-based language; you could write “dstruct SpamCues { ViagraCue viagra; CialisCue cialis; }” in the Medication spam filter module, and “dstruct SpamCues { FallicSymbolsCue fallic; SizeDescriptionsCue size; }” in the Penis Enlargement module. The structure is thus defined by all modules linked into the application.

When someone gets a SpamCues structure and accesses cues.viagra, the compiler generates a load-from-pointer-with-offset instruction - for example, in MIPS assembly it’s spelled “lw offset(ptrreg)”. However, the offset would be left for the linker to resolve, just the way it’s done today for pointers in “move reg, globalobjectlabel” and “jump globalfunclabel”.

This way, access to “distributed” structures would be as fast as “normal” structures. And you would preserve most optimizations related to adjacent offsets. For example, if your machine supports multiple loads, so a rectangle structure with 4 int members can be loaded to 4 registers with “ldm rectptrreg,{R0-R4}” or something, it could still be done because the compiler would know that the 4 members are adjacent; the only unknown thing would be the offset of the rectangle inside the larger struct.

One issue the linker could have on some architectures is handling very large offsets that don’t fit into the instruction encoding of load-from-pointer-with-offset forms. Well, I’d live even with the dumbest solution where you always waste an instruction to increment a pointer in case the offset is too large. And then you could probably do better than that, similarly to the way “far calls” (calls to functions at addresses too far from the point of call for the offset to fit into 28 bits or whatever the branch offset encoding size is on your machine) are handled today.

The whole thing can fail in presence of dynamic loading during program run as in dlopen/LoadLibrary; if you already have objects of the structure, and your language doesn’t support relocation because of using native pointers, then the dynamically loaded module won’t be able to add members to a dstruct since it can’t update the existing objects. Well, I can live with that limitation.

If the language generates native object files, there’s the problem of maintaining compatibility with the object file format. I think this could “almost” be done, by mapping a distributed structure to a custom section .dstruct.SpamCues, and implementing members (viagra, cialis, fallic, size) as global objects in that section. Then if an equivalent of a linker script says that the base address of .dstruct.SpamCues is 0, then &viagra will resolve to the offset of the member inside the structure. The change to automatically map sections named .dstruct.* to 0 surely isn’t more complicated than the handling of stuff like .gnu.linkonce, inflicted upon us by the idiocy of C++ templates and the likes of them.

And here’s why I’ll probably never see a struct linker:

  • If the language uses a native linker, a small change must be done to that linker in order to handle encodings of load/store instructions in ways it previously didn’t (currently it only has to deal with resolving pointers, not offsets). Since it’s platform-specific, the small change is actually quite large.
  • You could compromise and avoid that change by generating less efficient code which uses the already available linker ability to resolve the “address” of the viagra object in the zero-based .dstruct.SpamCues section - the code can add that “address” (offset, really) to &cues. But that could still force changes in the compiler back-end because now it has to generate assembly code adding what looks like 2 addresses, which makes no sense today and might be unsupported if the back-end preserves type information.
  • The previous items assume that the portable “front-end” work to support something like dstruct isn’t a big deal. However, I’d guess that not enough people would benefit from it/realize they’d benefit from it to make it appear in a mainstream language and its front-ends.
  • I could roll my own compiler to a language similar to a mainstream one, with a bunch of additions like this struct linker thingie. Two problems with this. One - it’s too hard to parse all the crud in a mainstream language (even if it isn’t C++) to make it worth the trouble, unless your compiler does something really grand; a bunch of nice features probably aren’t worth it. Two - most programmers take a losing approach towards their career where they want to put mainstream languages on their resume so that losers at the other end can scan their resumes for those languages; if your code is spelled in a dialect, you’ll scare off the losers forming the backbone of our industry.

It still amazes me how what really matters isn’t what can be done, but what’s already done. It’s easier to change goddamn hardware than it is to change “software infrastructure” like languages, software tools, APIs, protocols and all kinds of that shit. I mean, here we have something that’s possible and even easy to do, and yet completely impractical.

Guess I’ll have to roll my own yet-another-distributed-reflective-registration bullshit. Oh well.

I love globals, or Google Core Dump

The entire discussion only applies to unsafe languages, the ones that dump core. By which I mean, C. Or C++, if you’re really out of luck.

If it can dump core, it will dump core, unless it decides to silently corrupt its data instead. Trust my experience of working in a multi-processor, multi-threaded, multi-programmer, multi-nightmare environment. Some C++ FQA Lite readers claimed that the fact that I deal with lots of crashes in C++ indicates that I’m a crappy programmer surrounded by similar people. Luckily, you don’t really need to trust my experience, because you can trust Google’s. Do this:

  1. Find a Google office near you.
  2. Visit a Google toilet.
  3. You’ll find a page about software testing, with the subtitle “Debugging sucks. Testing rocks.” Read it.
  4. Recover from the trauma.
  5. Realize that the chances of you being better at eliminating bugs than Google are low.
  6. Read about the AdWords multi-threaded billing server nightmare.
  7. The server was written in C++. The bug couldn’t happen in a safe language. Meditate on it.
  8. Consider yourself enlightened.

This isn’t the reason why this post has “Google core dump” in its title, but hopefully it’s a reason for us to agree that your C/C++ program will crash, too.

I love globals

What happens when we face a core dump? Well, we need the same things you’d expect to look for in any investigation: names and addresses. Names of objects looking at which may explain what happened, their addresses to actually look at them, and type information to sensibly display them.

In C and C++, we have 3 kinds of addresses: stack, heap and global. Let’s see who lives there.

Except the stack is overwritten, because it can be. Don’t count on being able to see the function calls leading to the point of crash, nor the parameters and local variables of those functions. In fact, don’t even count on being able to see the point of crash itself: the program counter, the link register, the frame pointer, all that stuff can contain garbage.

And the heap is overwritten, too, nearly as badly. The typical data structure used by C/C++ allocators (for example, dlmalloc) is a kind of linked list, where each memory block is prefixed with its size so you can jump to the next one. Overwrite one of these size values and you will have lost the boundaries of the chunks above that address. That’s a loss of 50% of the heap objects on average, assuming uniform distribution of memory overwriting bugs across the address space of the heap.

So don’t count on the stack or the heap. Your only hope is that someone has ignored the Best Practices and the finger-pointing by the more proficient colleagues, and allocated a global object. Possibly under the clever disguise of a “Singleton”. Not a bad thing after all, that moronic “design pattern”, because it ultimately allowed to counter cargo cult programmers’ accusations of “globals are evil” with equally powerful cargo cult argument of “it’s a design pattern”. So people could allocate globals again.

Which is good, because a global always has an accurate name-to-address mapping, no matter what atrocity was committed by the bulk of unsafe code running on the loose. Can’t overwrite a symbol table. And it has accurate type information, too. As opposed to objects you find through void*, or a base class pointer where the base class lacks virtual functions or the object vptr was overwritten, etc.

Which is why I frequently start debugging by firing an object view window on a global, or running debugger macros which read globals, etc. Of course you can fuck up a global variable to make debugging unpleasant. For example, if the variable is “static” in the C sense, you need to open the right file or function to display it, and you need the debugger front-end to understand the context, which will be especially challenging if it’s a static variable in a template function (one of the best things in C++ is how neatly its new features interact with C’s old ones).

Or you can stuff the global into a class or a namespace. I was never able to display globals by their qualified C++ name in, say, gdb 5. But no matter; nm <program> | grep <global> followed by p *(TypeOfGlobal*)addr always does the trick, and no attempts at obfuscating the symbol table will stop it. I still say make it a real, unashamed global to make debugging easier. If you’re lucky, you’ll get to piss off a couple of cargo cult followers as a nice side-effect.

Google Core Dump

A core dump is a web. Its sites are objects. It’s hyperlinks are pointers. It’s PageRank is a TypeRank: what’s the type of this object according to the votes of the pointers stored in other objects? The spamdexing is done by pointer-like bit patterns stored in unused memory slots. The global variables are the major sites with high availability you can use as roots for the crawling.

What utilities would we like to have for this web? The usual stuff.

  • Browsers. Debugger object view window is the Firefox, and the memory view window is the Lynx. The core dump Lynx usually sucks in that it doesn’t make it easy to follow pointers - can’t click on a word and have the browser follow the pointer (by jumping to the memory pointed by it). No back button, either. Oh well.
  • DNS. The ability to translate variable names to raw addresses. Works very reliably for globals and passably otherwise. Works reliably for all objects in safe languages.
  • Reverse DNS. Given an address, tell me the object name. Problematic for dynamically allocated objects, although you could list the names of pointer variables leading to it (Google bombing). Works reliably for global functions and variables. For some reason, the standard addr2line program only supports functions though. Which is why I have an addr2sym program. It so happened that I have several of them, in fact. You can download one here. “Reverse DNS” is particularly useful when you find pointers somewhere in registers or memory and wonder what they could point to. In safe languages, you don’t have that problem because everything is typed and so you can simply display the pointed object.
  • Google Core Dump, similar to Google Desktop or Google for the WWW. Crawl a core dump, figure out the object boundaries and types by parsing the heap linked list and the stack and looking at pointers’ “votes”, create an index, and allow me to query that index. Lots of work, that, some of it heuristical. And in order to get type information in C or C++, you’ll have to either parse the source code (good luck doing it with C++), or parse the non-portable debug information format. But it’s doable; in fact, we have it, for our particular target/debugger/allocator combo. Of course it has its glitches. Quirky and obscure enough to make open sourcing it not worth the trouble.

I really wish there was a reasonably portable and reliable Google Core Dump kind of thing. But it doesn’t look like that many people care about debugging crashes at all. Most core dumps at customer sites seem to go to /dev/null, and those that can’t be easily deciphered are apparently given up on until the bug manifests itself in some other way or its cause is guessed by someone.

Am I coming from a particularly weird niche where the code size is large enough and the development rapid enough to make crashes almost unavoidable, but crashes in the final product version are almost intolerable? Or do most good projects allocate everything on the stack and the heap, so with those smashed they’re doomed no matter what? Or is the problem simply stinky enough to make it unattractive for a hobby project while lacking revenue potential to make a good commercial project?

Would you like this sort of thing? If you would, drop me a line. In the meanwhile, I satisfy my wish for a Google Core Dump with my perfect implementation for an embedded co-processor, the one I’ve poked at with Tcl commands. With 128K of memory, no dynamic allocation, and local variables effectively implemented as globals, perfect decoding is easy. I’m telling ya, globals rule.

As to my “reverse DNS” implementation:

  • I could make it more portable by parsing the output of nm --print-size. But just running nm on a 20M symbol table takes about 2 seconds. I want instantaneous output, ’cause I’m very impatient when I debug.
  • Alternatively, I could make it more portable by using a library such as bfd. But that would drag in a library such as bfd, and I had trouble with what looked like library/compiler version mismatches with bfd, whereas my ELF parsing code never had any trouble. Also, an implementation parsing ELF is more interesting as sample code because you get to see how easy to parse these formats are. So it’s elfaddr2sym, not addr2sym. (It’s really 32-bit-ELF-with-host-endianness-addr2sym, because I’m lazy and it covers all my targets.)
  • There’s a ton of addr2sym code out there, and maybe a good addr2sym program. I just didn’t find it. I have an acknowledged weakness in the wheel reinventing department.
  • Of course I don’t demangle the ugly C++ names; piping to c++filt does.
  • The program is in D, because of the “instantaneous” bit, and because D is one of the best choices available today if you care about both speed and brevity. Look at this: lowerBound!("a.st_value <= b")(ssyms, addr) does a binary search for addr in the sorted ssyms array. As brief as it gets out of the box with any language and standard library, isn’t it? The string is compiled statically into the instantiation of the lowerBound template; a & b are the arguments of the anonymous function represented by the string. Readable. Short. Fast. Easy to use - garbage-collected array outputs in functions like filter(), error messages to the point - that’s why a decent grammar is a good thing even if you aren’t the compiler writer. Looks a lot like C++, braces, static typing, everything. Thus easy to pimp in a 3GL environment, in particular, a C++ environment. You can download the Digital Mars D compiler for Linux, or wait for C++0x to solve 15% of the problems with <algorithm> by introducing worse problems.

By the way, the std.algorithm module, the one with the sort, filter, lowerBound and similar functions, is by Andrei Alexandrescu, of Modern C++ Design fame. How is it possible that his stuff in D is so yummy while his implementation of similar things in C++ is equally icky? Because C++ is to D what proper fixation is to anaesthesia. There, I bet you saw it coming.

What does “global” mean?

For the sake of completeness, I’d like to bore you with a discussion of the various aspects of globalhood, in the vanishingly small hope of this being useful in a battle against a cargo cult follower authoring a coding convention or such. In C++, “global” can mean at least 6 things:

  • Number of instances per process. A “global” is everything that’s instantiated once.
  • Life cycle. A “global” is constructed before main and destroyed after main. A static variable inside a function is not “global” in this sense.
  • “Scope” in the “namespace” sense (as opposed to the life cycle sense). We have C-style file scope, class scope, function scope, and “the true global scope”. And we have namespaces.
  • Storage. A “global” is assigned a link time address and stored there. In a singleton implementation calling new and assigning its output to a global pointer, the pointer is “global” in this sense but the object is not.
  • Access control. If it’s in a class scope, it may be private or protected, which makes it less of a global in this fifth sense.
  • Responsibility. A global can be accessible from everywhere but only actually accessed from a couple of places. For example, you can allocate a large object instantiating lots of members near your main function and then call object methods which aren’t aware that the stuff is allocated globally.

So when I share my love of globals with you, the question is which aspect of globality I mean. What I mean is this:

  1. I like global storage - link-time addresses - for everything which can be handled that way. A global pointer is better than nothing, but it can be overwritten and you will have lost the object; better allocate the entire thing globally.
  2. I like global scope, no classes, namespaces and access control keywords attached, to make symbol table look-up easier, thus making use of the global allocation.
  3. I like global life cycle - no Meyers’ singletons and lazy initialization. In fact, I like trivial constructors/destructors, leaving the actual work to init/close functions called by main(). This way, you can actually control the order in which things are done and know what the dependencies are. With Meyers’ singletons, the order of destruction is uncontrollable (it’s the reverse order of initialization, which doesn’t necessarily work). Solutions were proposed to this problem, so dreadful that I’m not going to discuss them. Just grow up, design the damned init/close sequence and be in control again. Why do people think that all major operations should be explicit except for initialization which should happen automagically when you least expect it?
  4. “Globals” in the sense of “touched by every piece of code” is the trademark style of a filthy swine. There are plenty of good reasons to use “globals”; none of them has anything to do with “globals” as in “variables nobody/everybody is responsible for”.
  5. I think that everything that’s instantiated once per process is a “global”, and when you wrap it with scope, access control, and design patterns, you shouldn’t stop calling it a global (and instead insist on “singleton”, “static class member”, etc.). It’s still a global, and its wrapping should be evaluated by its practical virtues. Currently, I see no point in wrapping globals in anything - plain old global variables are the thing best supported by all software tools I know.

I think this can be used as “rationale” in a coding guideline, maybe in the part allowing the use of globals as an “exception”. But I keep my hopes low.

Ahem

To make an embarrassing story short:

  1. The merge scenario from the previous post doesn’t work the way I said it works in any DVCS it was tried on. I talked about the case of “merging merges” - when two people resolve the same conflict independently in different clones, and then someone pulls from both clones. More specifically, I mentioned the case where the conflict was between two similar patches, and each of the two people took a different patch when resolving the conflict. I claimed BitKeeper would then remove both patches in the final merge. Well, I’m still sure it did work that way for me once (an older version of bk?.. some specifics I didn’t notice?..) But bk 4 works differently; it seems to take the later conflict resolution (throwing away one of the patches in my scenario, but not both). Mercurial reportedly takes the earlier resolution, throwing away the other patch. Git and Bazaar reportedly require user intervention, possibly preventing damage by automerges at the cost of, well, requiring user intervention. bk and hg do manage to create a working version in my specific scenario, but of course throwing away one of the conflict resolutions isn’t always safe that way, it’s only safe in my scenario where both resolutions are basically equivalent. Anyway, while “merging merges” is specific to DVCS, no contemporary one seems to screw you nearly as badly as I described; “most vexing merge”, oh, c’mon. Also, it would be easy to guess that they all deal with this scenario differently, because this whole business of merging is heuristical, what are the chances for different heuristics to do the same thing? And in general it’s awfully lame to publish stuff first and check it later. I suck.
  2. And the overall excited mood of that article sucked, too, ’cause, like, c’mon, everybody knows that automerges can cut your fingers off, big deal, calm down. I mean, the worst merge-related bugs I dealt with came from automerges that any kind of version control system would allow. Two changes done in different files, that kind of thing. Coding in a “merge-friendly” way is something few people do, and it isn’t that easy. For example, you basically must never change semantics of definitions. If your function didn’t lock that semaphore, and now it does, then a call added in another branch, which was completely safe, can now cause a deadlock. So what are you going to do, modify the function name each time you change its “observable semantics”? Is everybody really that anal-retentive about it? I doubt that. But we all live with automerges because it’s cheaper to deal with their occasional damage than with the constant damage of manual merges, which take lots of time and are intolerably boring, thus very error prone. Which is why I prefer distributed systems and their better ability to merge long-living branches due to detailed recording of change history, even though long-living branches are extremely harmful. Harmful as they are, they will occasionally flourish, and then you need strong automerge, not manual merge, to end their evil lives. But anyway, who cares about the preferences of a person who doesn’t even bother to check his own trivially testable claims?

Blech.

DVCS and its most vexing merge

There’s this very elegant way to shoot your leg off with a DVCS. Here’s the recipe:

  1. I create a clone of the repository A and call it B. You create another clone of A and call it C.
  2. I add an if statement in B, fixing a bug. You fix the same bug in C in a very similar way.
  3. I pull your changes from C to B. There’s a conflict around the bug since we both fixed it. I take my patch, throwing away your nearly identical patch.
  4. Meanwhile, your roommate pulled both B and C into his clone, D. And he had to resolve that conflict, too. And he took your patch, and threw mine away.
  5. Now, D and B are pushed back to A. DVCS quiz: what happens?

Answer: the system has accurately recorded the manual merges, so it has enough information to make this new merge automatically. It sees your patch, and it throws it away as it should according to my manual merge. It sees my patch, and it flushes it down the toilet since that’s what your roommate said. Net result: both patches are gone, the bug is back in business. (Edit: it doesn’t work that way - bk version 4 does a different thing, and other systems reportedly do still other things. Do you still want to read the rest of this?..)

Which of course doesn’t matter, since it’s immediately discovered by the massive Automated Test Suite. For example, if it’s an OS kernel, each revision is automatically tested on all hardware configurations it should run on. And the whole process only takes 10 minutes, according to the Ten Minute Build XP Practice. No harm done, no reason to discuss it. I just thought it was a curious thing worth sharing.

Maybe it’s a well-known thing, but I don’t think it is, and if I’m right, it’s definitely lovable. For example, here’s what BitMover, maker of BitKeeper, the common ancestor of DVCSes, has to say about this:

“It’s important to note that because BitKeeper has a star topology and its possible to share data with any repository, it’s not necessarily recommended.”

What this is trying to say is that the graph of pulls shouldn’t be a generic graph and you’re better off with a tree. That is, I shouldn’t pull directly from you; we should both pull from and push to A. You and your roommate should also synchronize via A, or via A’s “child” repository, but then you shouldn’t push to A directly, only via that child, and so on. If we maintain this tree structure, the same conflict will never be resolved twice, and then we won’t get screwed when the merges are merged.

I wonder if you could detect the situations when you “merge merges”, that is, when the same conflict was resolved differently. You could then insist on human intervention and save those humans’ bacon. I’m too lazy to think this out and too stupid to effortlessly see the answer, so I’ll resort to a social heuristic like all of us uber-formal nerds do. The social heuristic is that Larry McVoy of BitMover has probably already thought about this, and found ways in which it could fail. So I’m not going to argue that BitKeeper merges are broken.

What I’m going to argue, at least for the next couple of paragraphs, is that it sucks when they tell you about their superstar topology and then explain that it’s best to avoid actually using it. Not only that, but they fail to mention a fairly frightening and, trust me, not at all unlikely scenario which could actually persuade you to follow their advice.

Because when they tell me “we have this simple model of how things work - repositories with complete local history and changes flowing between them - but you should arbitrarily restrict yourself to a subset of this model, for reasons we aren’t going to share with you, even though the general case works”, when they tell me that, my reply is “I alias rm to rm -f”. I understand how rm works, it’s fairly simple, and I don’t like to talk about it over and over again, “Are you sure?” - yes, I’m sure, thank you very much and good bye.

But the lovable part is, speaking of social heuristics, the lovable part is that BitMover said it right. Because if they mentioned that fairly frightening and not at all unlikely scenario, they’d scare people off rather than illustrate a point. On the other hand, when they say “It’s good practice to think about how the data should flow”, most people will nod and follow whatever advice they give them.

Just imagine a team of programmers engaging in the practice of thinking about how the data should flow, dammit, all on company time. Yeah, yeah, so BitKeeper earned a sarcastic comment on Proper Fixation. It’s still a small price to pay for getting your message to the majority of your users.

You see, the majority of programmers are not just “irrational” as we all are, but their reliance on reasoning doesn’t even exceed the mean of the population, which means they barely use reasoning at all, it’s pure gut feeling.

For example, I was writing a bunch of macros in a proprietary debugger extension language. A guy who came to talk to me looked over my shoulder, and I explained, “Debugger macros. Very useful, a crappy language though.” He said, “Yeah, looks like so.”

HE COULDN’T POSSIBLY KNOW THAT. I knew he couldn’t. How could he look at the code and realize that all variables were global? How could he know they were printed upon assignment, including loop counters (’cause it’s a “macro”, so it works just like assigning at the debugger console, which prints the variable)? He couldn’t know any of that. So why did he agree with the “crappy” part? Oh, I knew why.

“You mean it has dollar signs?” Silence. “You mean it prefixes variable names with the dollar sign, don’t you?” “Yeah, that.” “Well, I like the dollar signs, helps you distinguish between your macro variables and the variables of the debugged C program. Anything else?” “Well, um, yeah, it looks kinda primitive.” Low-end Ignorant Language Bigotry quiz: if “crappy” means “has dollar signs”, what does “primitive” mean? Answer: no type declarations. I’m sure it was that, although I didn’t go on and ask.

So that’s “engineers” for you. If you want to write programs or tech docs for the average engineer, keep that picture in mind. Or don’t. Aim for the minority, for people who don’t work that way, under the assumption that they are the ones that actually matter the most. I don’t know if this assumption is right, but at least it’s lofty.

Why DVCS?

For the record, I had my share of both centralized and distributed version control systems, and today I like it distributed and I wouldn’t ever want to go back, The Most Vexing Merge be damned. Why? I’ll share my reasons with you through the story of My Last CVS To BitKeeper Exodus. I think I’ll illustrate the engineers-and-reasoning-don’t-mix point as a by-product, because that’s how that story goes.

There recently was this argument about DVCS encouraging “code bombs”, a.k.a “crawling into a cave”. I haven’t heard either of these terms, so I’ve been using a term of my own - “accumulating critical mass”. The idea is to develop in your own corner, without integrating it with the main branch. You then show up with N KLOC of new stuff and kindly ask to merge it in.

Some people claimed this was particularly harmful for open source projects where there was no managerial authority to prevent this. Ha! Double ha. In an open source project, the key maintainers may say, “you see, we can’t integrate it when it’s done this way; we’re sorry, but you should have talked to us.” The changes will then be reimplemented or dropped on the floor.

Now, if you think that in a commercial environment a manager can easily decide to drop changes on the floor, together with the cost of implementing them and especially the cost of delaying the delivery of the features, if you think that, well, I wonder why you do. But perhaps a manager could insist on frequent integration? She could try, but she’d have to deal with real or imaginary cost of merges, increasing over time and getting in the way of deliveries. Of course there are perfect managers and perfect teams where it’s all dealt with appropriately, you just have to find them.

So yeah, “code bombing” is a problem, especially in commercial projects. But the idea that DVCS encourages it? Hilarity! It’s like saying that guns encourage murder. I prefer to think of guns as something that encourages fear of armed policemen, getting in the way of the natural instinct to club your neighbors to death. I mean, yeah, it’s easier to code bomb with a DVCS, but with a centralized system, people use code bombing - or clubbing? - even more, because merging is harder, the cost of merges increases more quickly and the ability to force integration is thus lower. The criminals are poorly equipped, but so is the police.

And this is exactly what happened to the last team stuck with CVS that I helped migrate to BitKeeper. Everybody had their own version made up of file snapshots taken at different times and merged with the repository version to different extents. A centralized system doesn’t record these local versions, so unless you immediately commit your merges, you are left with a version of a file which the system doesn’t know. This means that the next merges are going to be really hard, because you’ve lost your GCA, the greatest common ancestor. So instead of a 3-way merge, you’ll be doing a 2-way merge, which really sucks.

So I decided to not talk about the caves they were crawling into and the code bombs they were throwing at each other. Rather, I decided to show them how a 2-way merge couldn’t do what a 3-way merge could. I still think it’s the ultimate argument for DVCS, because DVCS is basically about accurate recording of all versions and not just the single time line of the main trunk. So the argument for it has to do with the benefits of such detailed recording.

So I gave this example where you start with a file having two lines:
aaa
bbb

And then I add a line, ccc, and you delete a line, aaa. If we have the GCA (a 3-way merge), then clearly the right result is deleting aaa and adding ccc, getting this:
bbb
ccc

But with a 2-way merge, we only have your file:
bbb

…and my file:
aaa
bbb
ccc

This can only be merged manually, because there’s no way to automatically figure out that you deleted aaa and I added ccc; for all the tool knows, you could have done nothing and I’ve added two lines, so the right merge is:
aaa
bbb
ccc

…canceling your change. So it has to be manual merge. Manual merge means dozens of boring deltas you have to inspect in each file. That’s what I call “costly”.

Of course it doesn’t matter in a right world, where people integrate frequently and always commit their merged files to the centralized repository. Except it wasn’t so in the wrong world of the CVS developers I was “helping” to upgrade to new tools (for the last time in my life, people, for the last time in my life). And I thought we could avoid the discussion of the somewhat-technical-but-largely-social reasons of the constantly increasing cost of merges, and instead we could focus on the technical benefits of the 3-way merge and accurate GCA recording.

And of course I was wrong. The discussion immediately shifted to “we don’t need merges” because everything is “modular” and there’s a single owner to each file. Of course it wasn’t, and there wasn’t. Some things were used by everybody, like the awful build scripts and the DMA code. Some modules had two owners, or were in a transition state and had 1.5 owners, and so on. There were merges all over the place.

And if there weren’t merges and merge-related problems, how come everybody worked on their own “pirate” version which was never recorded in the main trunk and was made from a colorful variety of files partially merged at different times? How come changes propagated with cp and emacs-diff and not cvs update? And why was the tech lead so passionate about moving to BitKeeper which doesn’t let you partially update a repository so you have to merge everything? And why did everybody anxiously object that necessity if there were “no problems with merges”?

The final result: the tech lead simply forced the migration to bk. Everybody on the team hated me for my connection with the idea (I wasn’t on their team but I used to be a likable satellite and now became a hateful satellite). Developers who I thought were their best eventually (and I mean eventually) told me it was actually a good thing. So it wasn’t a bad closure. And still, I decided that I’m not going to “help” anybody “deploy” any kind of “tool” in this lifetime again, roughly speaking. Too much emotions for this programmer.

And this was supposed to show why I like DVCS, at least in the imperfect world where long-living branches occasionally happen, and the kind of reasoning I think is interesting in this context, and the kind of reasoning other people I came across found interesting. So there were are.

P.S. Why “most vexing”?

I thought I saw that “C++’s most vexing parse” from Scott Meyers’ Effective STL has its own Wikipedia entry, but apparently it doesn’t. It’s basically a variation on the theme of C++’s declaration/definition ambiguity, and I liked the term, especially the “most” part where parses are unambiguously sorted along the vexing dimension. So I figured “X’s most vexing Y” is a good template.

I’d like to use this opportunity to say that I skimmed though Effective C++, 3rd Edition, and… Where do I start? There’s an advice to create date objects with “Date d(Day(31), Month::april(), Year(2000))” or something. That is, create special types for the constructor arguments. Well, it doesn’t check that April comes without the 31th day, does it? The Date constructor could test for it though. Well, why not test for April 41st in the Date constructor, too, and, ya know, spare the users some keystrokes, if you see what I mean? The code is verbose. C++ compiler error messages are verbose. VERBOSITY EVERYWHERE! Help!

This raises the question to the author, whether he ever worked with a system where every piece of data comes covered with the toxic waste of overzealous static typing. But this borders on an ad hominem attack. And seriously, that sort of thing is to be avoided, at least until somebody proposes to have named constants for days or years and not just months.

So instead of the personal attack, I’ll ask Software Development Consultants, all of them, to kindly change the phrasing “it’s best to do so and so” to “I like so and so” or something. Because we have this huge crappy-dollar-sign crowd, and they copy style from each other like crazy, and their ultimate source of style is books by Software Development Consultants, and whenever they see a “best practice”, their common sense is turned off and they add the technique to the bag of tricks. So Consultants have a great power in this world. It doesn’t make the common sense shut-off feature their fault, but power they do have.

And with great power comes great responsibility, profanity deleted. I mean, you’re obviously giving advice neither you nor others have tested for very long, out of generic principles, profanity deleted. Like “prefer algorithms such as for_each to loops”, an advice issued before fully compliant implementations of STL were even widely available, profanity deleted. Quite a piece of advice, that, profanity deleted. Couldn’t you at least phrase your advices in a little less self-assured way, fucking profanity deleted?

For example, Meyers has finally lowered the bridge and let the enemy of template metaprogramming occupy a notable share of pages in an Effective C++ edition. I still remember his promise to “never write about templates”, in the preface to Modern C++ Design, I think. And now the promise is broken. Hordes of clueless weenies are rushing into the minefield of template metaprogramming as we speak, since it’s now officially Mainstream C++. Can you imagine the consequences? I can’t. It’s too awful. I think I’ll go to sleep now.

OO C is passable

My problem with C++ bashing is that I’m overqualified. Ever since I’ve published C++ FQA Lite, I’ve tried to stay out of “C++ sucks” discussions. I’ve said everything before and I don’t want to say it again. And then much of the C++ arcana is finally fading from my memory; good, no need to refresh it, thank you.

I don’t always quit those discussions though. How should I do that? If I were someone else, I could send a link to the C++ FQA to end the discussion (”if you really care, check out yada yada”). But I can’t use this link myself, because “OMG, did you actually write a whole site about this?!”

So the last time I didn’t quit this discussion, I said: “You know, at some point you actually start to prefer plain C”. The seasoned C++ lover replied: “You mean, the C with classes style, with no tricky stuff? Sure, don’t we all end up writing most of the code like that?” No, said I, I meant plain C. No plus signs attached. File names ending with .c. C without classes. C.

“Now that is fanaticism,” said the guy. “What could be the point of that? You know what? You may like C++ or you may dislike it, but face it: C just isn’t good enough”.

Fanaticism?! Now that is an insult. Yes, I recently decided to write a bunch of code in plain C inside a C++ code base. I did this after maintaining the previous, C++ implementation of that stuff for 4 years. For at least 3 of those years, I didn’t do any significant rewriting in that code, because it could interfere with schedules and it would complicate merges. Although some pieces were hopelessly awful. Sloppy text parsing interleaved with interrupt handling (I exaggerate, but only very slightly). But it worked, so it was hardly urgent to fix it.

And then it had to be ported to a new platform. And I obsessed over the rewrite-or-extend question for a lot of time. There was a critical mass of incompatible new stuff, so I chose “rewrite”. But. I decided to write the new version in C++. Why? Because it’s a C++ code base, and people are used to C++ and its style and idioms, however brain-damaged. “So you, YOU will do it in C++ after all?! You’re a wuss,” said my manager, an old-time C++ hater. Time for the rhetorical question. Does this sound like the story of a fanatic?

OK then, why did I decide to do it in C after all, you may ask. I’ll tell you why. I did it because everybody was sick and tired of the build time of my auto-generated C++ code.

You see, the whole thing is data-driven. The data objects describing its workload are generated at build time. Do you know any way of generating C++ code that doesn’t compile slowly as hell? I don’t. I’ve generated both “real” code (functions doing stuff) and “data definition” code (which does practically nothing except for calling object constructors). And it always compiles slowly. Sometimes turning optimization off speeds up compilation significantly, sometimes less significantly. But the best part is this: you never know exactly what’s your problem. Try to profile a C++ compiler.

It’s not that manually written C++ code is such a blast. For example, we have a ~2K LOC file defining a class template and explicitly instantiating it 4 times. It worked fine for years, until it met a particular version of the Green Hills C++ compiler. That version decided that it needs more than 1.5G of memory to compile that file. I don’t know how much more exactly, because at that point my workstation suffocated, and could only breathe again when the process ran out of swap space and died. Here’s another rhetorical question: how the fuck are you supposed to figure out what the fuck causes this problem?

What’s that? “It’s a tool problem, not a language problem?” Bzzzt, wrong answer! It’s neither a language problem nor a tool problem; it’s my problem, because I must fix it. And this is why I don’t want to deal with a language that consistently breeds tools which create such problems for me. But since I’m hardly a fanatic, and I know exactly why I do want to work on this particular C++ code base, I hold my nose and I delve right into the pile of excrements and find out that if you instantiate each template in its own file, then the process memory consumption barely crosses the 350M mark. Nice and compact, that. So, let’s use 4 files.

Nope, manually written C++ code isn’t a picnic. But auto-generated code is worse, because it relies on some set of features and uses them a lot. The number of uses per feature per file matters. 1 explicit template instantiation per file = 350M of compiler process memory. 4 instantiations = out of memory. What about “simpler” features, but hundreds of uses? The compiler will grind to a halt for sure. Um, you might say, don’t other languages have “features” which will be used hundreds of times by generated code? Yes, they do. Those features just DON’T SUCK quite as impressively. Go ahead, show me a problem with compilation speed anywhere near what C++ exhibits in another language.

Face it: C++ just isn’t good enough. “If you really care about C++ parsing complexity, check out the FQA yada yada”. I wrote “a whole site” about it, you know. Bottom line: if you generate native code, you want to define your object model such that the generated code can be C code. Assembly is pointlessly low-level and non-portable, and C++ sucks. Believe me, or die waiting for your code to compile.

So, C. My object model will be in C. Um. Bummer. It’s an OO thing, with plugins and multiple inheritance and virtual inheritance. It has to be. You have orthogonal plugins which want to derive classes from a common base - a classic diamond hierarchy. Well, I can have a couple of macros for doing MI-style pointer arithmetic, by fetching the derived-type-specific offset of each base class object at run time. No big deal. I even like it better than the C++ MI downcasting syntax - at least you know exactly what you’re doing, and you don’t need to think whether it should be dynamic_cast or static_cast or eat_flaming_death_cast to really work.

But I miss virtual functions. I really do. I sincerely think that each and every notable feature C++ adds to C makes the language worse, with the single exception of virtual functions. Here’s why not having virtual functions in C sucks:

  • You can’t quite fake them with C macros.
  • Virtual function call is a shortcut for obj->vtable->func(obj, args). The OO spelling - obj->func(args) - is of course better.
  • You’ll usually try to make the C version shorter: obj->func(args), obj->vtable->func(args), or obj->func(obj, args). Quite likely you’ll find out that you really needed to pass obj to func and/or the vtable level of indirection. Updating the code may be tedious/too late/really annoying (because of having to admit a stupid mistake). The eventual lack of call syntax uniformity will also be annoying.
  • Decent C++ debuggers automatically downcast base class object pointers to the real run time type when you inspect the object, even when C++ RTTI support is turned off at compile time. They do it by looking at the vtable pointer. Achieving this with OO C is only possible on a per-OO-faking-style, per-debugger basis, using the ugly debugger scripting facilities. Most likely, you won’t do it and choose interactive suffering each time you debug the code, having to figure out the actual type yourself and cast pointers manually.
  • With virtual functions, base class implementations are inherited automatically. With explicit vtable structures, you need to either have links to base class implementations (the slow MFC message table way), or you need to explicitly and fully fill the vtables in each derived class. Possibly using the C default of trailing zeros in aggregate initializers as in vtable_type vtable={foo_func,bar_func} /* and the third member, baz_func, is 0 - we check for zero vtable entries before calling our pseudo-virtual functions */. Run time checks for null function pointers can make initialization code smaller, but they also make function calls slower.
  • With explicit vtable initializers, you only see the position of the function in the vtable initializer and its “derived class” name (my_class_baz_func), not its “base class” name (baz_func). You are likely to have a slightly inconsistent “derived class method” naming convention, making it annoying to figure out exactly which base class function we’re overriding here.

An impressive list, isn’t it? You can see from it that I’ve really put my employer’s money where my mouth is and actually worked with OO C for a while. Aren’t C++ classes with virtual functions simply better? No, because C++ classes don’t support aggregate initialization. If you have C structures with vtable pointers, you can use the frob_type g_obj={&vtable,5,"name"} initialization style. This translates to assembly that looks like so:

g_obj:
.word vtable
.word 5
.word .str34
.str34:
.asciz "name"

This compiles and loads as fast as it gets. Now, if you choose real C++ vtables, you rule out aggregate initialization once and for all. Your initialization will instead have to be spelled as frob_type g_obj(5, "name"), and even if you have no constructor arguments, C++ will generate a constructor to set the vtable pointer.

The good news: at least the explicit reference to the vtable in our source code is gone. The bad news: with many objects, the C++ version compiles very slowly (I’ve checked with GNU and Green Hills C++). It also generates a much larger image, since it emits both global data objects and assembly code copying them into object members at run time. The latter code also costs you load time. And if you crash there, good luck figuring out the context. But as far as I’m concerned, the worst part is the build time explosion.

Yes, yes. It’s not important. And it’s FUD. And it’s a tool problem. A “good” compiler could optimize the C++ version and generate the same assembly as the C version. And it could do it quickly. Those compiler writers are just lame. I completely agree with you, sir. Just go away already.

By the way, the same trade-off happens with C++ containers, like std::vector. They’re better than {T*base;int size;} structures because you have the shortcut of operator[] (as opposed to array->base[i]). And because debuggers can gracefully display all elements of std::vector as a list of the right size. Some of the debuggers can. Sometimes. Sometimes it breaks. But when it doesn’t, it’s fun. But, again, you can’t use aggregate initialization once your structure has a std::vector in it. And C++0x won’t solve it, because its pseudo-aggregate initializers are syntactic sugar, and my problem here isn’t the syntax, it’s the build time.

And std::vector forces allocation of data on the heap (let’s not discuss custom allocator templates, ’cause I’m gonna vomit). Can’t have the base address of a std::vector point to a global variable generated specifically to hold its data.

I like things to point to globals generated to hold their data. Helps a lot when you debug, because your pointer is now firmly associated with a symbol table name. And no matter what memory-corrupting atrocity was committed by buggy code, that association will be there to help you figure things out. And heap corruption is very common in C++, because it’s a completely unsafe language. So I care a lot about debugging core dumps with corrupted memory. Which is why base, size structures get my vote.

And that’s an important point: you can live with just safety, or just control, and of course with both, but if you have neither safety nor control, then, sir, welcome to hell. Which is why OO C is worse than OO in Java or Lisp or PowerShell, but better than OO in C++.

And OO C is not all bad after all. Manually faking virtual functions has its benefits:

  • You can dynamically “change the object type” by changing the vtable pointer. I first saw this in POV-Ray, which has a vtable for circles and a vtable for ellipses. When a geometric transformation applied to a circle object transforms it to an ellipse, the vtable pointer is changed to point to the more generic and slower ellipse rendering functions. Neat. You could do this using C++-style OO by having another level of indirection, but with C, it’s marginally faster, which can matter sometimes. And the C way is much neater, which is useful to balance the frustration caused by the drawbacks of fake OO. I use this trick a lot for debug plugins in my fake OO stuff.
  • Likewise, you can overwrite individual vtable entries. This changes the type of all objects of that “class” - primarily useful for logging and other sorts of debugging.
  • Sometimes you really don’t need obj->vtable->func(obj, args) - say, obj->func(args) is good enough. And then you win.
  • You don’t have to use a structure with named members to represent a vtable. If a bunch of functions have the same prototype, you can keep them in an array. You can then iterate over them or use an index into that array as a “member function pointer”. This way, you can have a function calling a method in a bunch of objects, and the method to call will be a parameter of that function. The C++ member function pointers don’t support the iterate-over-methods trick as efficiently, and their syntax is remarkably ugly.
  • That each function has a unique non-mangled name (as opposed to A::func, B::func, etc.) has the benefit of making the symbol table clean and independent of the non-portable C++ mangling. And you no longer depend on the varying definition look-up abilities of debuggers and IDEs (I don’t like the lengthy disambiguation menus when I ask to show me the definition of “func”, and these menus show up too often).
  • If you serialize the objects, or you have programs shoveling through process snapshots and analyzing them, the memory image of OO C objects is easier to deal with because of being more portable. Basically you only need to know the target endianness, the alignment requirements of built-in types, sizeof(enum) and the implementation of bitfields, if you use that. With C++, you need to know the layouts of objects when multiple/virtual inheritance/empty base class optimization/etc. is at play; this isn’t portable across compilers. And even that knowledge is only useful if you know the members of each class; otherwise, you need to parse the layouts out of non-portable debug information databases - parsing C++ source code is not an option. With C, you can parse its files and find the struct definitions and figure out the layouts and it will be really easy and portable. Been there, done that.

Of course, you can use all those tricks in a C++ object model when needed, and use virtual functions elsewhere. And if C++ didn’t suck so much (for example, if code compiled reasonably fast and if it had a standard ABI and and and…), that would be the way to go. But since C++ just isn’t good enough, and I have to use a pure C object model, I point out the benefits of the sucky thing that is OO C to make it a little less bitter.

Overall, OO C is passable, more so than the endless rebuild cycles caused by our previous C++ object model. So you have a bunch of lame stuff. Having to use a typedef because struct A can only be referred to as struct A, not A. No default function arguments. Having to declare variables at the beginning of the scope. (BTW, I use C89. I don’t believe in C99. I’m a Luddite, you already know that, right? I don’t think C99 is as portable as C89 here in 2008).

So, yeah. I do miss some C++ features, but it’s a small itching here, a tiny scratching there, nothing serious. I can live with that as the price for not having my legs shot off by mysterious compile time explosions and other C++ goodies. OO C is good enough for me. Fanaticism? You be the judge.