Entries Tagged 'software' ↓

Working simultaneously vs waiting simultaneously

"Multiprocessing", "multi-threading", "parallelism", "concurrency" etc. etc. can give you two kinds of benefits:

  • Doing many things at once - 1000 multiplications every cycle.
  • Waiting for many things at once - wait for 1000 HTTP requests just issued.

Some systems help with one of these but not the other, so you want to know which one - and if it's the one you need.

For instance, CPython has the infamous GIL - global interpreter lock. To what extent does the GIL render CPython "useless on multiple cores"?

  • Indeed you can hardly do many things at once - not in a single pure Python process. One thread doing something takes the GIL and the other thread waits.
  • You can however wait for many things at once just fine - for example, using the multiprocessing module (pool.map), or you could spawn your own thread pool to do the same. Many Python threads can concurrently issue system calls that wait for data - reading from TCP sockets, etc. Then instead of 1000 request-wait, request-wait steps, you issue 1000 requests and wait for them all simultaneously. Could be close to a 1000x speed-up for long waits (with a 1000-thread worker pool; more on that below). Works like a charm.

So GIL is not a problem for "simultaneous waiting" for I/O. Is GIL a problem for simultaneous processing? If you ask me - no, because:

  • If you want performance, it's kinda funny to use pure Python and then mourn the fact that you can't run, on 8 cores, Python code that's 30-50x slower than C to begin with.
  • On the other hand, if you use C bindings, then the C code could use multiple threads actually running on multiple cores just fine; numpy does it if properly configured, for instance. Numpy also uses SIMD/vector instructions (SSE etc.) - another kind of "doing many things at once" that pure Python can't do regardless of the GIL.

So IMO Python doesn't have as bad a story in this department as it's reputed to have - and if it does look bad to you, you probably can't tolerate Python's slowness doing one thing at a time in the first place.

So Python - or C, for that matter - is OK for simultaneous waiting, but is it great? Probably not as great as Go or Erlang - which let you wait in parallel for millions of things. How do they do it? Cheap context management.

Context management is a big challenge of waiting for many things at once. If you wait for a million things, you need a million sets of variables keeping track of what exactly you're waiting for (has the header arrived? then I'm waiting for the query. has it arrived? then I ask the database and wait for it etc. etc.)

If those variables are thread-local variables in a million threads, then you run into one of the problems with C - and hence OS-supported threads designed to run C. The problem is that C has no idea how much stack it's gonna need (because of the halting problem, so you can't blame C); and C has no mechanism to detect that it ran out of stack space at runtime and allocate some more (because that's how its ABIs have evolved; in theory C could do this, but it doesn't.)

So the best thing a Unixy OS could do is, give C one page for the stack (say 4K), and make say the next 1-2M of the virtual address space unaccessible (with 64b pointers, address space is cheap). When C page-faults upon stack overflow, give it more physical memory - say another 4K. This method means at least 4K of allocated physical memory per thread, or 4G for a million threads - rather wasteful. (I think in practice it's usually way worse.) All regardless of us often needing a fraction of that memory for the actual state.

And that's before we got to the cost of context switching - which can be made smaller if we use setjmp/longjmp-based coroutines or something similar, but that wouldn't help much with stack space. C's lax approach to stack management - which is the way it is to shave a few cycles off the function call cost - can thus make C terribly inefficient in terms of memory footprint (speed vs space is generally a common trade-off - it's just a bad one in the specific use case of "massive waiting" in C).

So Go/Erlang don't rely on the C-ish OS threads but roll their own - based on their stack management, which doesn't require a contiguous block of addresses. And AFAIK you really can't get readable and efficient "massive waiting" code in any other way - your alternatives, apart from the readable but inefficient threads, are:

  • Manual state machine management - yuck
  • Layered state machines as in Twisted - better, but you still have callbacks looking at state variables
  • Continuation passing as in Node.js - perhaps nicer still, but still far from the smoothness of threads/processes/coroutines

The old Node.js slides say that "green threads/coroutines can improve the situation dramatically, but there is still machinery involved". I'm not sure how that machinery - the machinery in Go or Erlang - is any worse than the machinery involved in continuation passing and event loops (unless the argument is about compatibility more than efficiency - in which case machinery seems to me a surprising choice of words.)

Millions of cheap threads or whatever you call them are exciting if you wait for many events. Are they exciting if you do many things at once? No; C threads are just fine - and C is faster to begin with. You likely don't want to use threads directly - it's ugly - but you can multiplex tasks onto threads easily enough.

A "task" doesn't need to have its own context - it's just a function bound to its arguments. When a worker thread is out of work, it grabs the task out of a queue and runs it to completion. Because the machine works - rather than waits - you don't have the problems with stack management created by waiting. You only wait when there's no more work, but never in the middle of things.

So a thread pool running millions of tasks doesn't need a million threads. It can be a thread per core, maybe more if you have some waiting - say, if you wait for stuff offloaded to a GPU/DSP.

I really don't understand how Joe Armstrong could say Erlang is faster than C on multiple cores, or things to that effect, with examples involving image processing - instead of event handling which is where Erlang can be said to be more efficient.

Finally, a hardware-level example - which kind of hardware is good at simultaneous working, and which is good at simultaneous waiting?

If your goal is parallelizing work, eventually you'll deteriorate to SIMD. SIMD is great because there's just one "manager" - instruction sequencer - for many "workers" - ALUs. CPUs, DSPs and GPUs all have SIMD. NVIDIA calls its ALUs "cores" and 16-32 ALUs running the same instruction "threads", but that's just shameless marketing. A "thread" implies, to everyone but a marketeer, independent control flow, while GPU "threads" march in lockstep.

In practice, SIMD is hard despite thousands of man-years having been invested into better languages and libraries - because telling a bunch of dumb soldiers marching in lockstep what to do is just harder than running a bunch of self-motivating threads each doing its own thing.

(Harder in one way, easier in another: marching in lockstep precludes races - non-deterministic, once-in-a-blue-moon, scary races. But races of the kind arising between worker threads can be almost completely remedied with tools. Managing the dumb ALUs can not be made easier with tools and libraries to the same extent - not even close. Where I work, roughly there's an entire team responsible for SIMD programming, while threading is mostly automatic and bugs are weeded out by automated testing.)

If, however, you expect to be waiting much of the time - for memory or for high-latency floating point operations, for instance - then hoards of hardware threads lacking their own ALUs, as in barrel threading or hyper-threading, can be a great idea, while SIMD might do nothing for you. Similarly, a bunch of weaker cores can be better than a smaller number of stronger cores. The point being, what you really need here is a cheap way to keep context and switch between contexts, while actually doing a lot at once is unlikely to be possible in the first place.

Conclusions

  • Doing little or nothing while waiting for many things is both surprisingly useful and surprisingly hard (which took me way too long to internalize both in my hardware-related and software/server-related work). It motivates things looking rather strange, such as "green threads" and hardware threads without their own ALUs.
  • Actually doing many things in parallel - to me the more "obviously useful" thing - is difficult in an entirely different way. It tends to drag in ugly languages, intrinsics, libraries etc. about as much as having to do one single thing quickly. The "parallelism" part is actually the simplest (few threads so easy context management; races either non-existent [SIMD] or very easy to weed out [worker pool running tasks])
  • People doing servers (which wait a lot) and people doing number-crunching (work) think very differently about these things. Transplanting experience/advice from one area to the other can lead to nonsensical conclusions.

See also

Parallelism and concurrency need different tools - expands on the reasons for races being easy to find in computational code - but impossible to even uniformly define for most event handling code.

Can your static type system handle linear algebra?

It seems a good idea to use a static type system to enforce unit correctness - to make sure that we don't add meters to inches or some such. After all, spacecraft has been lost due to such bugs costing hundreds of millions.

Almost any static type system lets us check units in sufficiently simple cases. I'm going to talk about one specific case which seems to me very hard and I'm not sure there's a type system that handles it.

Suppose we have a bunch of points - real-world measurements in meters - that should all lie on a straight line, modulo measurement noise. Our goal is to find that line - A and B in the equation:

A*x + B = y

If we had exactly 2 points, A and B would be unique and we could obtain them by solving the equation system:

A*x1 + B = y1
A*x2 + B = y2

2 equations, 2 unknowns - we're all set. With more than 2 measurements however, we have an overdetermined equation system - that is, the 3rd equation given by the measurement x3, y3 may "disagree" with the first two. So our A and B must be a compromise between all the equations.

The "disagreement" of x,y with A and B is |A*x + B - y|, since if A and B fit a given x,y pair perfectly, A*x+B is exactly y. Minimizing the sum of these distances from each y sounds like what we want.

Unfortunately, we can't get what we want because minimizing functions with absolute values in them cannot be done analytically. The analytical minimum is where the derivative is 0 but you can't take the derivative of abs.

So instead we'll minimize the sum of (A*x + B - y)^2 - not exactly what we want, but at least errors are non-negative (good), they grow when the distance grows (good), and this is what everybody else is doing in these cases (excellent). If the cheating makes us uneasy, we can try RANSAC later, or maybe sticking our head in the sand or something.

What's the minimum of our error function? Math says, let's first write our equations in matrix form:

(x1 1)*(A B) = y1
(x2 1)*(A B) = y2
...
(xN 1)*(A B) = yN

Let's call the matrix of all (xi 1) "X" and let's call the vector of all yi "Y". Then solving the equation system X'*X*(A B) = X'*Y gives us A, B minimizing our error function. Matlab's "\" operator - as in X\Y - even does this automatically for overdetermined systems; that's how you "solve" them.

And now to static type systems. We start out with the following types:

  • x,y are measured in meters (m).
  • A is a number (n)- a scale factor - while B is in meters, so that A*x + B is in meters as y should be.

What about X'*X and X'*Y? X'*X looks like this:

sum(xi*xi) sum(xi) //m^2 m
sum(xi)    sum(1)  //m   n

….while X'*Y looks like this:

sum(xi*yi) sum(yi) //m^2 m

The good news is that when we solve for A and B using Cramer's rule which involves computing determinants, things work out wonderfully - A indeed comes out in meters and B comes out as a plain number. So if we code this manually - compute X'*X and X'*Y by summing the terms for all our x,y points and then hand-code the determinants computation - that could rather easily be made to work in many static type systems.

The bad news is that thinking about the type of Matlab's \ operator or some other generic least-squares fitting function makes my head bleed.

Imagine that large equation systems will not be solved by Cramer's rule but by iterative methods and all the intermediate results will need to have a type.

Imagine that we may be looking, not for a line, but a parabola. Then our X'*X matrix would have sum(x^4), sum(x^3) etc. in it - and while spelling out Cramer's rule works nicely again, a generic function's input and output types would have to account for this possibility.

Perhaps the sanest approach is we type every column of X and Y separately, and every element of X'*X and X'*Y separately - not meters^something but T1, T2, T3… And then we see what comes out, and whether intermediate results have sensible types. But I just can't imagine the monstrous types of products of the various elements ever eventually cancelling out as nicely as they apparently should - not in a real flesh-and-blood programming language.

Maybe I'm missing something here? I will admit that I sort of gave up on elegance in programming in general and static typing in particular rather long ago. But we have many static type systems claiming to be very powerful and concise these days. Maybe if I did some catching-up I would regain the optimism of years long gone.

Perhaps in some of these systems, there's an elegant solution to the problem of \'s type - more elegant than just casting everything to "number" when it goes into \ and casting the result back into whatever types we declared for the result, forgoing the automatic type checking.

So - can your static type system do linear algebra while checking that measurement units are used consistently?

C++11 FQA anyone?

isocpp.org announced "the C++ FAQ" to which reportedly Bjarne Stroustrup and Marshall Cline will link to help boost its search rank. Good stuff; also, given that the language is >30 years old - it's about time.

Which reminds me. I'm failing to keep my promise to update the C++ FQA, an overly combative reply to Marshall Cline's fairly combative FAQ, to the new standard from hell - C++11.

Could you perhaps do it instead of me? Unfortunately, explaining how terrible C++ is cannot be made into a career the way explaining how wonderful C++ is can be. So I sort of moved on to other things; I managed to summon the mental energy just that one time to write the C++ FQA, and even as I wrote it I knew that I better hurry because that energy was waning. After all, at work I managed to reduce the amount of C++ I deal with to a minimum, so there was neither a profit motive nor continuous frustration to recharge me.

Not that I haven't made money off the C++ FQA - indirectly I did make money, specifically by getting a headhunter's attention and significantly improving my employment conditions. But I don't think a new FQA's maintainer/co-author could count on something along those lines.

Still, you'd become about as widely famous in narrow circles as I am, getting maybe 200K unique visitors per year and hundreds of "thank you" emails. Most importantly, you'd do the world a great service.

There's a curious double standard in the world of programming languages. Say, PHP is widely ridiculed because, for instance, its string comparison operator converts strings to numbers if they start with "0e" and this results in unexpected behavior. And it doesn't help PHP that === works just fine.

Imagine someone mocking C++ for two char pointers comparing "wrong" with ==. Immediately they'd be told that casting to std::string (more verbose than ===) would work just fine, and that "you just don't get it".

Why is PHP - a language full of quirks which at least gives you memory and type safety - is universally ridiculed and the developers, while defending the language, never ridicule back, while C++, an absolutely insane language, is ridiculed often enough but C++ developers always counter-attack viciously? For that matter, why isn't Lisp ridiculed for EQ, EQL, EQUAL and EQUALP, if comparison operators are so funny?

The reason, IMO, is simple: PHP is not taught in academic CS courses. C++ developers are much more likely to have a CS degree, therefore both they and others treat their knowledge of crazy C++ arcana as something of intellectual value. And Lisp is the poster child of academic programming language development. Educated proponents deter attempts at ridiculing a language.

In a "rational" universe, PHP would be held in high esteem given the amount of output produced by PHP developers with little training. In our universe we have this double- or triple-standard. Pointing out the darker aspects of C++ - which are most of its aspects - thus increases the supply of a rather scarce commodity.

But why C++ and not, say, Lisp, Haskell or C#?

One reason is that C++ is arguably the craziest language in widespread use, combining the safety of C (and I can live with C just fine) with the clarity of Perl (and I can live with Perl peacefully enough). But it is of course subjective.

The thing that really sets C++ apart is its development culture and value system - a perverse amalgamation of down-to-earth shrewdness and idealistic perfectionism. The whole idea of making the best, richest, most efficient and most generic/versatile programming language of the planet - you can sense that Bjarne Stroustrup aims at nothing less - ON TOP OF C is the perfect illustration of this culture, and perhaps its origin.

Stroustrup always knew that the language would be better if it didn't have to be compatible with C and he publicly acknowledged it - the "smaller, much simpler language struggling to come out" remark. But he also knew that using an Embrace, Extend and Exterminate strategy is a much more likely way to succeed. So he did that. In fact he managed to more or less kill C or at least put it in a coma - as even C99, not to mention C11, will never be supported by the Microsoft compiler, and C89 is a really old and really restrictive standard.

A shrewd move - almost an evilly shrewd one - netting the people behind C++ fame and fortune at the expense of programmers dealing with a uniquely dangerous language full of sugar-coated death traps.

(Yes, sugar-coated death traps, you clueless cheerleaders. X& obj=a.b().c() - oops, b() is a temporary object and c() returns a reference into it! Shouldn't have assigned that to a reference. Not many chances for a compiler warning, either.)

Who did things differently? Sun and Microsoft, for example, marketing Java and C#, respectively. They made much cleaner languages from scratch, and to solve the chicken-and-egg problem - we have no legacy projects hence no programmers hence no new projects hence no programmers - they used money, large marketing budgets and large budgets for creating large standard libraries. A much more honest approach yielding much better results, I find.

And Stroustrup says about himself that he "lacks marketing clout" and says Java and C# are bad for you because they're "platforms, not languages", whatever that means.

And I'm not claiming to be able to read minds, but if I had to bet - I'd say he really believes that. He probably thinks he's your altruistic benefactor and Java and C# are evil attempts to drag you into proprietary platforms.

And you can see the same shrewdness, the same "altruism", the same attention to detail, the same tunnel vision - "this shit I'm working on is so important, it deserves all of my mental energy AND the mental energy of my users at the expense of caring about anything else" - throughout the C++ culture. From the boost libraries to "Modern C++ Design" (the author has since repented and moved to D - or did he repent?..) to the justifications for duplicate and triplicate and still incomplete language features to your local C++ expert carrying his crazy libraries and syntactic wrappers and tangling your entire code base in his net.

And this approach to life - "altruism" plus perfectionism plus cleverness plus shrewdness - extends way beyond C++ and way beyond programming. And the alternative is taming your ambitions.

This was my larger purpose in writing about all this shit. I'm pretty sure I failed in the sense that it got drowned in C++ error messages and other shits and giggles.

So maybe you'll do better than me. If you do, you might contribute to the sanity of many a young idealistic programmer - these tend to get sucked into C++'s sphere of influence, either emerging old and embittered years later or lost to sanity forever, stuck in an internally consistent but absolutely crazy way of thinking.

BTW I never wanted it to be a personal attack, in the sense that (1) I don't know what's inside the heads of people who promote C++ and (2) I can tell you with certainty that if I made something 10 times as bad as C++ and it was 0.0001x as popular as C++, I'd be immensely proud of myself and I wouldn't give a damn about what anyone thought. Just like I don't give a damn about people with so much time on their hands to actually have thousands of karma points at StackOverflow explaining just how lame C++ FQA is.

In this sense, C++ is fine and a worthy achievement of a lifetime. Especially in a world where Putin is a candidate for a Nobel Peace prize and Obama already got one.

I basically just think that (1) C++'s horrible quirks are worth pointing out and (2) there's something to be learned from this story about the way we pave the road to hell with our good intentions. That's all.

***

A lot of people have spoken about "a C++ renaissance" when C++11 was ratified. I tend to agree - indeed the new standard is a fresh doze of the same thing that C++ always was: "zero-overhead" pretty-looking syntax with semantics quite horrendous once you think what it actually means.

Fittingly for a new revision of the C++ standard, more things we could once safely count on have gone with the wind. For instance, anything you could pass to a function used to be an expression, and expressions had one and only one type. How ironic that this revision, the revision that finally capitalized on this fact by introducing auto, also made this fact no longer true: {0} could be an int or an std::initializer_list, depending on the context. Context-dependent types were one thing that Perl had (scalar/vector context) but C++ didn't have.

(I have observed someone do this: _myarr[5]={0}; - they had in the .h file the definition int _myarr[5] and they remembered that this thing could be initialized with {0} in other contexts. What they did wouldn't compile in C++98; in C++11 it promptly assigned the int 0 to the non-existent 5th element of _myarr, and the usual hilarity ensued. Imagine how PHP would be ridiculed for this kind of little behavior - and PHP at least would never overwrite an unrelated variable with garbage. Imagine how with C++, the poor programmer will be ridiculed instead.)

This is nothing, BTW - it's not the kind of thing the FQA would normally poke fun at. We have bigger fish to fry. For instance, C++11's "closures" or "lambdas" - the preposterous thing where you say [](){…} and an anonymous struct gets generated. BTW it's one of the reasons I urged everyone where I work to upgrade to C++11; because it lets you implement a nicely-looking parallel_for. In C I would have added a compiler extension for it AGES AGO but extending the C++ grammar? No sir. I waited patiently until C++11 lambdas.

So, C++11 lambdas. Someone needs to write everything about those hideous capture lists - are references captured by value or by reference?! - and how you can end up passing dangling references if this closure thingie outlives variables it references (we don't need no stinking garbage collection! but why doesn't C++11 let one capture variables by std::shared_ptr?.. It's so much better than gc!), and about the type of this shit and how it looks in debuggers, and about std::function etc. etc.

And this won't be me because frankly, I don't have time to even fully master this arcana anymore.

If you dislike C++ and have the time to write about it, I'll gladly pass the torch on to you; specifically I'll redirect C++ FQA to point to your site, following the example of Bjarne Stroustrup and Marshall Cline. Or we could run a wiki, or something. Email or comment if you're interested.

Very funny, gdb. Ve-ery funny.

Have you ever opened a core dump with gdb, tried to print a C++ std::vector element, and got the following?

(gdb) p v[0]
You can't do that without a process to debug.

So after seeing this for years, my thoughts traveled along the path of, we could make a process out of the core dump.

No really, there used to be Unices with a program called undump that did just that. All you need to do is take the (say) ELF core dump file and generate an ELF executable file which loads the memory image saved in the core (that's actually the easier, portable part) and initializes registers to the right values (the harder, less portable part). I even wrote a limited version of undump for PowerPC once.

So we expended some effort on it at work.

And then I thought I'd just check a live C++ process (which I normally don't do, for various reasons). Let's print a vector element:

(gdb) p v[0]
Could not find operator[].
(gdb) p v.at(0)
Cannot evaluate function — may be inlined.

Very funny, gdb. "You can't do that without a process to debug". Well, I guess you never did say that I could do that with a process to debug, now did you. Because, sure enough, I can't. Rolling on the floor, laughing. Ahem.

I suggest that we all ditch our evil C arrays and switch to slow-compiling, still-not-boundary-checked, still-not-working-in-debuggers-after-all-these-YEARS std::vector, std::array and any of the other zillion "improvements".

And gdb has these pretty printers which, if installed correctly (not easy with several gcc/STL versions around), can display std::vector - as in all of its 10000 elements, if that's how many elements it has. But they still don't let you print vec[0].member.vec2[5].member2. Sheesh!

P.S. undump could be useful for other things, say a nice sort of obfuscating scripting language compiler - Perl used to use undump for that AFAIK. And undump would in fact let you call functions in core dumps - if said functions could be, um, found by gdb. Still, ouch.

P.P.S. What gdb prints and when depends on things I do not comprehend. I failed to reproduce the reported behavior in full at home. I've seen it for years at work though.

Delayed printf for real-time logging

The article is at embeddedrelated.com. The nice bit is, you save format string pointers to files, and you read the strings later directly from the executable, and then do the formatting. There are a few tricks related to scripting gdb in Python and other such things. The upshot is that you get to do free-form text logging from virtually any context - including things like interrupt handlers, etc.

Coroutines in one page of C

I've written a single page implementation of coroutines in C using setjmp/longjmp and just a little bit of inline assembly. The article is at embeddedrelated.com. Here's an example C program using coroutines - equivalent to the Python code:

def iterate(max_x, max_y):
  for x in range(max_x):
    for y in range(max_y):
      yield x,y

for x,y in iterate(2,2):
  print x,y

In C:

#include <stdio.h>
#include "coroutine.h"

typedef struct {
  coroutine* c;
  int max_x, max_y;
  int x, y;
} iter;

void iterate(void* p) {
  iter* it = (iter*)p;
  int x,y;
  for(x=0; x<it->max_x; x++) {
    for(y=0; y<it->max_y; y++) {
      it->x = x;
      it->y = y;
      yield(it->c);
    }
  }
}

#define N 1024

int main() {
  coroutine c;
  int stack[N];
  iter it = {&c, 3, 2};
  start(&c, &iterate, &it, stack+N);
  while(next(&c)) {
    printf("x=%d y=%d\n", it.x, it.y);
  }
}

Yeah it's a bit longer, or perhaps more than a bit longer. But it's still very useful at times. Check it out!

Parallelism and concurrency need different tools

concurrent (noun): Archaic. a rival or competitor.

dictionary.com

Two lines that do not intersect are called parallel lines.

Wikipedia

In this piece, I disagree with Joe Armstrong and Rob Pike, basing my argument on the differences between vending machines and gift boxes (illustrated with drawings carefully prepared in Microsoft Paint).

Parallelism and concurrency are both very fashionable notions. Lots of languages and tools are advertised as good at these things - often at both things.

I believe that concurrency and parallelism call for very different tools, and each tool can be really good at either one or the other. To oversimplify:

It's possible for both kinds of tools to co-exist in a single language or system. For example, Haskell appears to have good tools for both concurrency and parallelism. But it's still two different sets of tools, and the Haskell wiki explains that you shouldn't use the concurrency tools if you're after parallelism:

Rule of thumb: use Pure Parallelism if you can, Concurrency otherwise.

The people behind Haskell realize that one tool for both isn't good enough. It's a similar story with the new language ParaSail - it offers both kinds of tools, and advises to avoid its concurrency features in parallel, non-concurrent programs.

This is in stark contrast to the opinion of some people primarily interested in concurrency, who think that their tools are a great fit for parallelism. Rob Pike says Go makes parallelism easy because it's a concurrent language:

Concurrency makes parallelism (and scaling and everything else) easy.

Likewise, Joe Armstrong claims that Erlang is great for parallelism because it's a concurrent language. He goes as far as to say that seeking to parallelize "legacy" code written in non-concurrent languages is "solving the wrong problem".

What causes this difference in opinions? Why do Haskell and ParaSail people think you need separate tools for concurrency and parallelism, while Go and Erlang people think that concurrent languages handle parallelism just fine?

I think people reach these different conclusions because they work on different problems, and develop a different perception of the basic distinction between concurrency and parallelism. Joe Armstrong has drawn a picture to explain how he sees that distinction. I'll draw you a different picture - but first, here's a reproduction of his:

Concurrency-centric view of concurrency vs parallelism

Actually many aspects of concurrency are present with just one queue, but I reproduced the 2 queues from Armstrong's original drawing. So what does this picture say?

  • "Parallel" means that you serve Coke faster.
  • "Parallel" doesn't mean much else - either way, it's the same concurrent problem.
  • Who gets Coke first depends on who gets in line first.
  • In a way it doesn't matter who gets Coke first - they all get a can of Coke one way or another.
  • …except perhaps for the few last ones, if the machine runs out of Coke - but that's life, man. Somebody has to be the last.

Indeed, that's all there is to it with a vending machine. What about giving gifts to a bunch of kids? Is there a difference between coke cans and presents?

In fact there is. The vending machine is an event handling system: people come at unpredictable moments, and find an unpredictable amount of cans in the machine. The gift boxes is a computational system: you know the kids, you know what you bought, and you decide who gets what and how.

In the gift boxes scenario, "concurrent vs parallel" looks very different:

Parallelism-centric view of concurrency vs parallelism

How is "concurrent" different from "parallel" in this example?

  • Concurrent present-dealing is a lot like a vending machine: who gets what depends on who got there first.
  • Parallel present-dealing is not like that: each present is labeled, so you know who gets what.
  • In the concurrent case, a queue is necessary - that's how you avoid two kids fighting over a present/two tasks corrupting a shared data structure.
  • In the parallel case, a queue is not necessary. No matter who gets where first, they all end up with the right present.

Queueing in front of a gift heap is concurrent in that archaic sense of "rivalry, competition": you want to be there first, so that it's you who gets the Lego set. In Russian, "concurrent" (pronounced kon-koo-rent) is the contemporary word for "competitor".

With labeled gifts, there's no competition. Logically, labels "connect" each kid with his gift, and these are parallel, non-intersecting, non-conflicting lines. (Why did I draw the arrows so that they do intersect? We'll get back to it later; a good way to think of it is, kids'/processors' paths do cross as they search for their gifts/data - but those intersections are not conflicts over who gets what.)

Computation vs event handling

With event handling systems such as vending machines, telephony, web servers and banks, concurrency is inherent to the problem - you must resolve inevitable conflicts between unpredictable requests. Parallelism is a part of the solution - it speeds things up, but the root of the problem is concurrency.

With computational systems such as gift boxes, graphics, computer vision and scientific computing, concurrency is not a part of the problem - you compute an output from inputs known in advance, without any external events. Parallelism is where the problems start - it speeds things up, but it can introduce bugs.

Let's proceed to discuss the differences between computational systems and event handling systems. We'll start with obvious things, like determinism, but there are many more subtle consequences.

Determinism: desirable vs impossible

In computational systems, determinism is desirable because it makes life easier in many ways. For example, it's nice to test optimizations and refactorings by observing that results didn't change - which you can only do with deterministic programs.

Determinism is often not strictly required - you may genuinely not care which 100 images you find as long as they all have kittens in them, or exactly what approximation of PI you compute as long as it's somewhere between 3 and 4. Determinism is merely very nice - and possible.

In event handling systems, however, determinism is impossible, in the sense that it is a requirement for different orders of events to produce different results. If I got there first, I should get the last Coke can, the last dollar in the shared bank account, etc. If you beat me to it by a millisecond, then it should go to you.

Of course, for basic debugging, you can always run the system completely serially, handling one event a time. But once you make it a bit more realistic and start handling events without first finishing everything that you were doing, you can no longer expect a specific result.

Even in a debugging environment, if you replay an event log with two users racing to withdraw money from a shared bank account, you'll run the thing twice and it might reach two different final states. The number of possible final states is exponential in the number of conflicts.

Ouch.

Sign of parallel safety: determinism vs correctness

How do you know that you have no bugs due to the different ordering of events?

In computational systems, if you get the same result every time, you probably have no parallelism bugs - even if the result is busted. Dog pictures instead of cats means you have bugs - but not parallelism bugs if it's the same dogs every time.

In event handling systems, the only sure sign of not having parallelism bugs is if you always get correct results.

With two users racing to withdraw money, you can't expect to always reach the same result. What can you expect, assuming the bank isn't buggy? Many things. You (presumably) can never have a negative account balance, you (presumably) can't drain a bank account twice, in effect creating money, etc.

If none of these wrong things ever happen, then you got yourself a functioning bank, otherwise you have a bug. With a telephony system, it's similar - except that there's a different set of requirements defining "correct results".

There's no general way to tell if there are timing-related bugs without understanding the aspects of the system having nothing to do with timing.

Ouch!

Parallelism bugs: easy to pinpoint vs impossible to define

With labeled gift boxes, parallelism bugs are trivial to spot - even if the labels are in Chinese, and you can't read Chinese:

Even if you can't read the labels, knowing that they exist is enough. If two kids fight over a box, then something is wrong and you call an adult who can read the labels.

If you're an automated debugging tool, and you don't know which task is supposed to process which data, it's enough to know that a task shouldn't access data modified by another task. If that happens, you tell the programmer, so that he properly assigns data to tasks without conflicts.

Such tools aren't hypothetical - they're out there. Cilk comes with a tool that does that. Checkedthreads has a Valgrind-based tool that does that.

Parallel Haskell doesn't have to do it - because you have no side effects to begin with, guaranteeing zero conflicts when things are evaluated in parallel. But even with dynamic checking instead of static guarantees, your parallelism bugs are basically just gone.

The upshot is that in computational systems, you don't need to know much to flag bugs and to point to the exact problem they cause. "Here's the box they were fighting over".

In event handling systems, a responsible adult must be much more knowledgeable to maintain discipline.

To be sure, there still is a weaker, but universal rule which always applies: you can't touch anything if someone is already busy with it. You must wait your turn in a queue. An adult can make sure this rule is followed, which is better than nothing. But it's not enough to prevent many common kinds of mischief:

Someone approaching the presents without queuing is clearly wrong - bug detected, order restored.

But someone unpacking a present, leaving the queue, and coming back to find that the present was taken by someone else could be wrong - or it could be OK. After all, that other kid waited his turn, so the only universal rule of event handling systems was followed - but not necessarily the specific rules of how we give out presents around here.

Here's how these problems look in code. The following buggy money transfer implementation can be flagged as buggy by an automated debugging tool:

src.balance -= amount
dst.balance += amount

Here we have no synchronization at all - src.balance can be modified by two processes without one of them waiting for the other to be done with it. So you could lose some of the decrements or what-not. A data race detector like Helgrind will spot this bug by monitoring memory accesses and observing the lack of synchronization - just as easily as Cilk's or checkedthreads' checker.

However, the version below is probably still buggy, but it would look fine to data race detectors:

atomic { src.balance -= amount }
atomic { dst.balance += amount }

Here, "atomic" means that everyone has to wait before modifying the balances in some sort of queue - possibly a very fancy sort of queue, but a queue nonetheless; more on that later. Orderly queuing makes data race detectors happy - but is there still a bug?

A thread could be suspended after decrementing src.balance but before incrementing dst.balance, exposing an intermediate state where money is "temporarily lost". Is this a problem? Perhaps. Do you understand banking? I don't, not really - and Helgrind surely doesn't.

Here's a more obviously wrong program:

if src.balance - amount > 0:
  atomic { src.balance -= amount }
  atomic { dst.balance += amount }

Here, a thread could test that src.balance has enough money to withdraw a given amount, and go to pee. Another thread arrives, makes a similar test and gets busy withdrawing the money. The first thread comes back, is put into some queue, waits his turn and withdraws its own amount - which could have become illegitimate, the first thread having withdrawn too much already.

This is a lot like coming back and realizing that another kid has walked away with your Apple iPhone®. It's a race condition which isn't a data race - that is, it happens despite everyone's civil and polite queuing.

What's a race condition? It depends on the application. I'm sure the last snippet is buggy, but I'm less sure about the previous snippet, and it all depends on how the bank works. And if we can't even define "race conditions" without knowing what the program is trying to do, we can't hope to automatically detect them.

Of course you can also not have race conditions. All you have to do is implement the entire withdrawal atomically, and of course all approaches to concurrency let you do it.

The trouble with race conditions though, as opposed to, say, null pointer exceptions, is that you can feed the program bug-triggering inputs again and again, and the thing will only reproduce once in a blue moon. Here's where deterministic, automated debugging could be really handy, flagging the bug every time you feed those inputs. Unfortunately, it's impossible with event handling systems.

"Ouch" again.

Queues: implementation detail vs part of the interface

With labeled gifts, you don't need a queue. Everyone can just go and take their present.

Or maybe not, not really. Thousands of kids getting thousands of labeled presents will need a queue or several, or else they'd be running into each other. How those queues work doesn't matter in the sense that everyone ends up with the same present anyway. The choice of a queuing method does affect how much time you wait for your present though.

That's why I drew the "logically parallel" lines connecting everyone to their labeled gifts so that they intersect - even though these aren't "actual conflicts" over who gets what. (I'm very careful with my metaphors and my Microsoft Paint art - very careful.)

4 processors accessing unrelated data items through a single memory bus is when "logically parallel lines technically cross", and in effect processors will have to wait in some hardware-level queue to make it work. 1000 logically independent tasks distributed to these 4 processors through a load balancing scheduler will again need queues to work.

There are almost always many queues even in a fully parallel, conflict-free system - but they're an implementation detail and they shouldn't affect results. You get the same thing no matter what your place was in any of the queues:

Conversely, in a concurrent system, the queue is right there in the interface:

  • A semaphore has a queue of threads waiting to lock it, and who gets there first will affect results.
  • An Erlang process has a queue of messages, and who sends a message first will affect results.
  • A goroutine listens to a channel, and the order of writes to a channel affects results.
  • With transactional memory, everyone failing to commit a transaction is queuing.
  • With lock-free containers, everyone failing to update the container is queuing.

With event systems, you can have simple queues or fancy queues, but they are all explicit queues in the sense that order often affects results, and that's OK - or it could be a race condition, go figure.

With computational, parallel, conflict-free systems, order shouldn't affect results - and you want tools to make sure the system is indeed conflict-free.

Rob Pike shows in his slides how easy it is to build a load balancer in Go. Sure it's easy. Go is for concurrency and concurrency is all about queues, and so is load balancing. Not that load balancers are very hard in any language, but yeah it's particularly easy in concurrent languages.

But that's one part of the parallelism story; the next thing you want is either static guarantees of not having conflicts, or dynamic checking. I'm not saying it can't be done in Go - sure it can be done - just that without having it, you underserve computational systems. And once you do have it, it's a different set of interfaces and tools from channels and goroutines - even if underneath it's implemented with channels and goroutines.

Importance of preemption

So conflict prevention/detection is something computational systems need that concurrency tools don't provide. Are there things that concurrency tools do provide that computational systems don't need?

Sure - explicit queues, for starters. Not only aren't they needed, but explicit queues get in the way of computational systems, because as we've seen, queues lead to race conditions that aren't data races and you can't pinpoint those.

Another thing computational systems don't really need is very cheap preemptable threads/processes.

With event handling systems, you sometimes want to react to many different events very quickly, which takes many preemptable processes. You want to have 10000 processes stuck in the middle of something, and process number 10001 can still get activated immediately when an event arrives. This can't work unless preemptable processes are very cheap.

With computational systems, it's nice to have cheap tasks that you can map to the relatively expensive OS threads - but tasks are not as powerful as processes. You don't activate tasks upon events. What tasks do is they wait in various queues, and get processed when a worker thread becomes idle to run them. Unlike the case with goroutines or the like, you can't have more tasks in flight than you have OS threads - and you don't need to.

Tasks can be done nicely with fairly traditional runtimes - as opposed to full-blown cheap processes/goroutines/whatever, which seem to me to need more work at the low-level runtime system side of things.

This shows how platforms aiming at concurrency not only under-serve computational systems, but over-serve them as well.

(To be fair, it is theoretically conceivable to gain something from preemption in a computational system - namely, it could improve throughput if it lets freshly created tasks which are a part of the critical path preempt in-flight tasks which are not. However, in my long and sad experience, a scheduler actually knowing what the critical paths are is little more than a theoretical possibility. And a dumb greedy scheduler has no use for preemption.)

Differences among tools from the same family

Tools aimed at concurrent event systems are not all alike, nor are tools aimed at parallel computational systems. They're from the same family, but there are substantial differences.

  • Erlang will not let processes share memory at all. This means you never have data races, which doesn't particularly impress me because as we've seen, data races are easy to find automatically - and race conditions are not at all eliminated by not sharing memory. The nice thing though is you can seamlessly scale to multiple boxes, not just multiple cores in the same chip.
  • Rust won't let you share memory unless it's immutable. No easy multi-box scaling, but better performance on a single box, and no need for data race detectors, which can have false negatives because of poor test coverage. (Actually it's not quite like that - here's a correction, also explaining that there are plans to add parallelism tools to Rust in addition to its existing concurrency facilities.)
  • Go will let you share everything, which gives the most performance at the price of what I think is a tolerable verification burden. For data races, Go has a data race detector, and race conditions will just happen in event systems anyway.
  • STM Haskell will let you share immutable data freely, and mutable data if you explicitly ask for it. It also provides a transactional memory interface, which I think is a cool thing to have and sometimes quite a pain to emulate with other methods. Haskell has other concurrency tools as well - there are channels, and if you want to have Erlang's scalability to multiple machines, apparently Cloud Haskell does the trick.

Of course the other big difference is that you'd be programming in Erlang, Rust, Go and Haskell, respectively. And now to computational systems:

  • Parallel Haskell will only parallelize pure code. A static guarantee of no parallelism bugs at the cost of having no side effects.
  • ParaSail allows side effects, but disallows many other things, such as pointers, and as a result it will only evaluate things in parallel if they don't share mutable data (for example, you can process two array slices in parallel if the compiler is convinced that they don't overlap). Similarly to Haskell, ParaSail has some concurrency support - namely "concurrent objects" which indeed can be shared and mutable - and documentation stresses the benefits of not using concurrency tools when all you need is parallelism.
  • Flow appears to be based on a pure functional core, which is restricted even further to let the compiler fully comprehend the data flow in the program, and allowing it to target platforms as diverse as Hadoop and CUDA. Things syntactically looking like side effects, parallel reduction, etc. are supposed to be a layer of sugar atop this core. At least that's my impression from reading the Manifesto, which admittedly I didn't fully understand ("if a morphism is surjective and injective, then it is a bijection and therefore it is invertible" is more obvious to some of us than others.)
  • Cilk is C with keywords for parallel loops and function calls. It will let you share mutable data and shoot yourself in the foot, but it comes with tools that deterministically find those bugs, if they could ever happen on your test inputs. What makes uninhibited shared mutable data very useful is when you don't shoot yourself in the feet - when a parallel loop computes stuff with whatever task-local side-effect-based optimizations, and then the loop ends and now everyone can use the stuff. Like children unpacking their Lego sets, each building stuff from them and then all playing together - no side effects is sometimes a lesser Lego. (The Proper Fixation blog - over-extending metaphors since 2008.)
  • checkedthreads is a lot like Cilk; it doesn't rely on language extensions, and all of it is free and open - not just the interface and the runtime but the bug-finding tools as well.

I wrote checkedthreads, so this is the advertisement part; checkedthreads is portable, free, safe and available today in the very mainstream languages C and C++, unlike many systems requiring new languages or language extensions.

With Cilk, there's an effort to standardize it in C++1y or some such, but Cilk wants to add keywords and the C++ people don't like adding keywords. Cilk is available in branches of gcc and LLVM, but it won't run on all platforms at the moment (it extends the ABI) and it hasn't been merged into the mainline for a while. Some newer Cilk features are patented. Not all of it is freely available, etc. etc.

Cilk's big advantage, however, is that it's backed up by Intel, whereas checkedthreads is backed up by yours truly. And if Cilk suits you and you decide to use it as a result of my checkedthreads-related blogging, I will have achieved my goal of automated debugging of parallel programs getting some much-deserved attention.

The upshot is that not all concurrency tools are alike, and different parallelism tools are likewise different - I don't even claim to have pointed out the most important differences between my examples; it's hairy stuff. Still, they're two different families, and the first thing to do is to pick the right family.

Conclusion

We've discussed differences between parallel, computational systems and concurrent, event handling systems. Areas of differences include:

  • Determinism: desirable vs impossible
  • Sign of parallel safety: same results vs correct results
  • Parallelism bugs: easy to pinpoint vs impossible to define
  • Queues: implementation detail vs part of the interface
  • Preemption: nearly useless vs nearly essential

For event handling systems, concurrency is their essence and parallelism is a part of some solutions - typically good ones (two vending machines is better than one). For computational systems, parallelism is their essence and concurrency is a part of some solutions - typically bad ones (a heap of gifts is usually worse than labeled gifts).

I hope to have gained more clarity than confusion by occasionally conflating "parallelism/concurrency" with "computation/event handling". I also hope I didn't paint too dark a picture of event handling systems - perhaps there are automated verification strategies that I'm not aware of. However it came out, I don't claim that my viewpoint and terminology usage is "the" way of looking at this.

There's value in the angle preferred by some people interested in event handling systems - "concurrency is dealing with several things at a time, parallelism is doing several things at a time". From this angle, parallelism is an implementation detail, while concurrency is in the structure of the program.

I believe there's value in my perspective as well - namely, "concurrency is dealing with inevitable timing-related conflicts, parallelism is avoiding unnecessary conflicts" - "vending machines vs labeled gifts". Here's how the two look like - now with parallel arrows disentangled, as they logically are:

The most important takeaway is that computational code, as opposed to event handling code, can be made virtually bug-free rather easily, using either automated debugging tools or static guarantees.

Handling parallelism using its own tools is nothing new. Rob Pike's work on Sawzall, a specialized parallel language where code is always free from parallelism bugs, predates his work on the concurrent language Go.

However, tools for concurrency appear "louder" than tools for parallelism at the moment - and they can handle parallelism, albeit relatively badly. Loud and bad often crowds out the less visible better. It'll be a pity if better support for parallelism won't be developed as a side effect of "loud concurrency" - or if such support atrophies where already available.

I'd go as far as replying to Armstrong's "parallelizing serial code is solving the wrong problem" with "using 'bare' concurrency tools for computational code is applying your solution to the wrong problem". The simple fact is that C parallelized with the help of the right tools is faster and safer than Erlang.

So here's to "using the right tool for the job", and not having anyone walk away with your Apple iPhone®.

checkedthreads: bug-free shared memory parallelism

If you aren’t deeply frightened about all the issues raised by concurrency, you aren’t thinking about it hard enough.

John Carmack

This post introduces checkedthreads - a free framework for parallelizing C and C++ code, and for automatically finding every race condition that could potentially manifest on given program inputs. It comes with:

  • a Valgrind-based checker for thorough verification
  • an event-reordering scheduler for fast verification

The code in checkedthreads is fresh and wasn't used in production yet. However, tools using the same approach have been successfully used for years in automotive safety software containing millions of lines of code written by many dozens of developers.

A nice use case for checkedthreads is, you have a complex, serial program that you want to parallelize. With checkedthreads, you'll be able to run your test suite and be sure that you have no parallelism bugs - as sure as you'd be about, say, memory leaks. And if your parallelization does introduce bugs, the Valgrind checker will pinpoint them for you, so that you can quickly fix them.

In what follows, I explain how race detection works in checkedthreads, and then briefly discuss its other features and how to get started with it.

***

Why are threads so error-prone? We're accustomed to see the root of the problem in mutable shared memory. Commonly proposed alternatives to threads avoid mutable shared memory. For example, pure functional code gets rid of the "mutable" part, and process-based parallelism gets rid of the "shared" part".

While processes and pure FP have their virtues, I believe that mutable shared memory is not, by itself, what makes threads bug-prone. You can keep mutable shared memory and eliminate the bugs. Moreover, as we'll see, it's perfectly possible to eliminate shared memory - and keep the bugs.

What, then, is the root of the problem? I believe what we need is to find the right interfaces. Threads and locks are great low-level primitives, but a bad interface to use directly in source code.

In this sense, threads are not unlike goto. Goto is a horrible interface for human programmers. But it's fine as a machine instruction underlying higher-level interfaces ranging from loops and function calls to exceptions and coroutines.

What higher-level interfaces do we have on top of threads? One such interface is fork/join parallelism - parallel loops and function calls. ("Loops and function calls" is notably very similar to the most popular interface on top of goto.) "Fork/join", because starting a parallel loop logically forks a thread per iteration, and those threads are joined back when the loop ends:

Ordering constraints of fork-join code

Examples of fork/join interfaces include TBB's/PPL's parallel_for and parallel_invoke, and OpenMP's #pragma omp parallel for. Checkedthreads provides something similar as well:

ctx_for(_objs.size(), [&](int i) {
  process(_objs[i]); //in parallel for all i's
});

Fork/join parallelism is well-known, and appreciated for its automation of synchronization and load balancing. But how does a fork/join interface help with program correctness compared to raw threads and locks?

We'll see how fork/join helps by looking at two methods to verify parallel code - event reordering and memory access monitoring. Both methods are implemented by checkedthreads and together, they guarantee near-100% freedom from parallelism bugs.

We'll discuss why these methods are so effective with fork/join code - and why they are comparatively ineffective with raw threads.

Since parallel function calls can be implemented using parallel loops, we'll only explicitly mention parallel loops. And we'll only discuss "pure" fork/join programs - programs where forking and joining is the only synchronization mechanism.

That is, code inside a parallel loop can access stuff written by whoever spawned the loop - and code after the loop can access stuff written in the loop. But you can't access shared data using semaphores, atomic counters, lock-free containers, etc. - such accesses will be flagged as bugs. To synchronize threads, you must fork or join.

Lastly, we'll assume that you can't proceed until the loop you spawned completes - and you can't wait for anything but a loop you spawned yourself. (This is an obvious property of code spelled using some kind of a "parallel for", but not of other fork/join interfaces; a recent paper calls this property "strict fork/join parallelism".)

We're thus assuming both "purity" and "strictness"… which sounds more restrictive than it is - but that is a separate topic. And now, with all the assumptions spelled out:

Why fork/join code is verifiable - and why raw threads aren't

First, let's consider event reordering - a cheap and effective verification method.

Suppose thread A writes to address X, and thread B concurrently reads from X. Then B might see A's write or it might not, depending on timing, so it's a bug. A cheap way to find this bug is, don't run the program with the production thread scheduler, where the order of events depends on timing.

Instead, use a debugging scheduler which purposefully and deterministically reorders events. Make it schedule things so that in one run, A's write precedes B's read - and in another run, B's read comes first. Then compare the results of the two runs - if they differ, there's a bug.

How many orders does it take to reorder every two instructions that could ever run in parallel? (Actually, this requirement is too weak to find all the bugs - we'll plug that hole later - but it's usually enough to find most of the bugs.)

In a fork/join program, you need just two orders:

  1. run all loops from 0 to N.
  2. run them all backwards, from N to 0.

Here's an illustration - consider this example pseudo-code with nested parallel loops:

parallel for i=0:2
  foo(i)
  parallel for j=0:i
    bar(i,j)
  baz(i)

The ordering constraints - which code block can run in parallel with which - make up the following DAG:

Ordering constraints of fork-join code

And here are the two schedules - the 0 to N one and the backwards one:

The two event reordering schedules

Pick any two instructions that could possibly run in parallel and you'll see that they are reordered in these two schedules.

What happens if you use raw threads and locks?

Then the ordering constraints don't have to look like a simple fork/join DAG. All we can say is that the ordering constraints form some sort of a partial order over the set of serial code blocks (whose instructions are fully ordered).

With N code blocks, an upper bound on the number of orders you might need is N*2. (For every code segment, schedule it to run as early as possible, and then schedule it to run as late as possible - that's N*2 schedules reordering every two independent instructions.)

But N*2 is a lot of orders - N can be rather large. Can we do better, as we did in the fork/join case - by having less schedules which reorder more things?

Perhaps we can - but finding out the lower bound on the number of orders is an NP-hard problem (its standard name is finding the poset dimension). And heuristics analyzing partial orders in an attempt to produce fewer than N*2 orders take minutes to run with N in the few thousands, in my experience.

The upshot is that reordering every pair of independent instructions is trivial with fork/join code and very hard with raw threads.

Now let's consider an improvement upon simple event reordering - memory access monitoring based on something like program instrumentation or a compiler pass.

Plain event reordering has two main drawbacks:

  • Some bugs are missed. Consider updates to accumulators. Whether we run sum+=a[5] before or after sum+=a[6], sum will reach the same value - whereas in a parallel run, it may not. (Finer-grained reordering - trying every way to interleave instructions that could run in parallel, and not just reordering every pair of independent instructions - would catch the bug. But that's obviously infeasible.)
  • Bugs aren't pinpointed. Reordering gives evidence that you have a bug by demonstrating that results differ under different schedules. But how do you find the bug?

Here's how we can improve upon plain reordering. Let's intercept memory accesses and record the ID of the thread which was the last to write to each and every location - the location's owner. (Logically, each loop index could run in a separate thread - so ideally, we'd map each index to a separate thread ID. In practice, we might want IDs to be small, so we'd use low bits of the index or some such.)

Then if a location is accessed by someone whose ID is different from the owner's ID, we have a bug, and we've pinpointed it - all we need to do is print the current call stack:

checkedthreads: error - thread 56 accessed 0x7FF000340 [0x7FF000340,4], owned by 55
==2919==    at 0×40202C: std::_Function_handler…::_M_invoke (bug.cpp:16)
==2919==    by 0×403293: ct_valgrind_for_loop (valgrind_imp.c:62)
==2919==    by 0×4031C8: ct_valgrind_for (valgrind_imp.c:82)
==2919==    by 0×40283C: ct_for (ct_api.c:177)
==2919==    by 0×401E9D: main (bug.cpp:20)

It's a tad more complicated with nested loops but not much more complicated; the details aren't worth discussing here. The point is, this works regardless of whether there's an effect on results that can be reproduced in a serial run. We pinpoint bugs involving accumulators and shared temporary buffers and what-not, even though results look fine in serial runs.

Note that we still rely on event reordering. That's because we catch the bug if an address was written first and then read - but not if it was read first and then written:

Memory access monitoring

This is OK - if you have a reordering scheduler guaranteeing that in one of your runs, the address is actually written first and then read. With such a scheduler and access monitoring, you'll never miss a bug if it could ever happen with your input data - and you'll always have it pinpointed.

Why doesn't it work nearly as well with raw threads and semaphores?

The first problem is, you need a reordering scheduler, and there are too many schedules to cover, as we discussed above. I've shown in the past an example where Helgrind, a memory access monitoring Valgrind tool for debugging pthread applications, misses a bug - because the bug is masked by the order in which things happen to run.

But there's another problem with raw threads and locks, preventing memory access monitoring from pinpointing some of the bugs when they do reproduce.

Specifically, if two threads access a location, and they use a lock to synchronize their accesses, then you can't flag the access as a bug, right? However, it very well may be a bug. It's certainly not a data race (unsynchronized access to memory) - but it can be a race condition (a bug due to event ordering).

As an example, consider a supposedly deterministic simulation (not an interactive game - though the screenshot below is from a game, Lock 'n' Chase). In the simulation, agents run around a maze, picking up coins. Whoever made it first to pick up a coin has won a race - literally!

Lock and Chase

Suppose each agent is simulated by its own thread. Then it might be that the simulated speed of an agent in the maze depends on the amount of CPU time its thread used compared to others. And then results depend on timing and it's a bug. (A thread per maze region would be a better strategy than a thread per agent.)

This is a race condition - because timing affects results. However, as long as agents lock the coins they pick up, it's not a data race - because access to shared memory is synchronized. And memory access monitoring can only pinpoint data races, not race conditions.

You can simulate a maze, notice that memory where you keep a coin representation is accessed concurrently without locking, and print the offending call stack.

But you can't simulate a maze, notice that with different event ordering, different agents pick up the (properly locked) coin, and "print the offending call stack". There simply is no single offending call stack - only the fact that results end up differing.

John Regehr has a great discussion on the difference between race conditions and data races. Check out his example with a bank account, where all account balances are locked "nicely" - but the bank still doesn't properly transfer money between accounts.

The thing that matters in our context is that with pure fork/join code, all race conditions are data races. In the general case of raw threads, it's no longer true - which gets in the way of pinpointing bugs.

Note that using something like Go's channels or even Erlang's processes doesn't solve the maze race problem - in the sense that you still can't pinpoint bugs. Instead of locking coins, you might have agent processes or goroutines or what-not - and one or several processes keeping coins. Agents would send requests to pick up coins to these processes - and who'd make it first isn't deterministic, and there's no single place in the code "where the bug is".

This is one example of eliminating shared memory and keeping the bugs - as opposed to fork/join code's ability to keep shared memory while (automatically) eliminating the bugs.

(This is not to say that Erlang-style lightweight processes aren't a good idea - they can be great in certain contexts. I do believe that parallelizing computational code is a somewhat different problem from handling concurrent events - and that fork/join parallelism is close to "the right thing" for computational code.)

To summarize the entire discussion of verification:

  • With fork/join code, every bug which could possibly manifest with the given inputs can be deterministically reproduced and pinpointed.
  • Conversely, with raw threads and locks, it's computationally hard to deterministically reproduce bugs. Furthermore, even reproducible bugs can not always be automatically pinpointed.

It is also notable that with raw threads, you get to worry about deadlocks. With pure fork/join code, there simply can never be a deadlock.

(The analysis above mostly isn't rigorous and you're welcome to correct me.)

Some features of checkedthreads

Noteworthy features of checkedthreads are discussed here. A short summary:

  • Guaranteed bug detection as discussed above.
  • Integration with other frameworks: if you already use OpenMP or TBB, you can configure checkedthreads to use their scheduler (and their thread pool) instead of fighting over the machine with the other framework.
  • Dynamic load balancing: work gets done as soon as a thread is available to do it. Load balancing automatically takes into account variability between the different tasks - as well as whatever unexpected load the CPUs might be handling while running your code.
  • A C89 and a C++11 API are available.
  • Free as in "do what you want with it" (FreeBSD license).
  • Easily portable (in theory).
  • Small and simple (at the moment).
  • Custom schedulers are very easy to add (though it's more of a recreational activity than a necessity, or so I hope).

Downloading, building, installing and using checkedthreads

I recommend to read the build instructions, and then, possibly, download precompiled binaries. (The build instructions are useful, in particular, to know what you're getting in the binary archive.) The source code of version 1.0 is archived here; git@github.com:yosefk/checkedthreads.git keeps the latest source code.

Before actually using checkedthreads, I recommend to read the rather short documentation of the API, the environment variables, and runtime verification.

If you run into any issues with checkedthreads, drop me a line at Yossi.Kreinin@gmail.com.

Conclusion

To quote the same piece by John Carmack again:

…if you have a large enough codebase, any class of error that is syntactically legal probably exists there.

In other words, "correct by design" is an almost non-existent property; correctness is either demonstrated automatically, or it is absent. While nobody has to make the simple errors we discussed in our toy examples, analogous errors will necessarily creep into large parallel programs.

Checkedthreads shows that for one significant family of parallel imperative programs - fork/join code - it's possible to automatically, deterministically reproduce and pinpoint every bug.

I hope it to be a convincing example of what I think is a more general truth - namely, that parallel imperative programs don't have to be "deeply frightening", any more than serial imperative programs ought to be a scary nest of computed gotos. What we need is the right higher-level interfaces on top of raw threads and locks.

There generally appears to be a necessary trade-off between flexibility and correctness. A lot of widely known approaches to parallelism maximize one at the expense of the other. Raw threads - and many of the higher-level frameworks out there - are very flexible, but it's very hard to know that your code is correct. With pure functional code, you have statically guaranteed determinism - but "no side effects" is a very severe restriction.

Checkedthreads attempts to offer a different balance: determinism guaranteed by testing instead of statically - and an imperative language to program in, albeit with less synchronization options than those available with raw threads.

I hope you like it - and don't hesitate to email me if you need any sort of support :-) (In particular, I'll gladly port the thing to other platforms/languages if people are interested.)

Is program speed less important than X?

Is program speed less important than safety? Sometimes it is - and sometimes speed is safety. A slow autopilot is a dangerous autopilot. That's why so much safety-critical software is written in the least safe programming languages.

A lot of programs aren't like autopilots - a slower, safer transaction processor is usually better. Google's early credit card charging disaster is my favorite example (it would never happen with a bounds-checked, garbage-collected language runtime).

However, there's the "counter-anecdote" of a cell phone company switching to a fancy new billing system which took more time to charge customers for calls than it took customers to make calls. It fell behind to the point where they had to avoid charging customers for a month worth of calls because they couldn't compute the right amounts. So sometimes trusting one's money to a slow program is rather unsafe.

And then there are high-frequency trading algorithms. Speaking of which: is speed less important than program correctness? Sometimes it is - and sometimes speed is correctness. A slower chess program playing under time control will settle for a worse move - a less correct move. Slower project scheduling software will come up with a worse schedule.

Life in general is a game played under time control. Often, "slower" means "being able to process less information in less ways" - in other words, dumber, further away from "correct".

What about time to market - isn't program speed less important than time to market? Sometimes it is - and sometimes higher speed is shorter time to market. A breathtaking game or special effect on today's hardware that others can only pull off on tomorrow's hardware means that the game or the special effect made it first to the market.

("Performance tricks" sound more relevant to the real-time world of games than to the offline rendering of movies; a great counter-example is procedural generation of outdoor landscape in Brave by Inigo Quilez.)

But time to market is also affected by development time; is program speed less important than development time? Sometimes it is - and sometimes higher speed is less development time. A developer waiting for slow programs develops more slowly - and developers often wait for their own programs (tools searching for stuff or summarizing stuff, build systems, tests, machine learning algorithms, …).

Another point is that a developer whose code tends to be too slow will waste time looking for a faster, fancier, buggier algorithm, sometimes sifting through many options, each of which could be fast enough if coded by someone else. A developer whose code tends to be fast the first time will move on to the next thing more quickly.

Is program speed less important than X? Sometimes it is - but sometimes speed is inseparable from X.

How profilers lie: the cases of gprof and KCachegrind

We'll see how gprof and KCachegrind lie to us and why they do so, discuss the limits within which we can trust them nonetheless, and attempt to draw more general conclusions about profilers and profile visualization tools.

(In case your programming vocabulary is different from mine - I use "lying" in its dispassionate meaning of "communicating falsehoods" and not to convey negative judgement; on the contrary, I'm indebted to the authors of profiling tools both as a user and a developer of such tools.)

So, consider a program with two parts - an easy part and a hard part. Both parts do similar work but one part does much more work than the other:

void work(int n) {
  volatile int i=0; //don't optimize away
  while(i++ < n);
}
void easy() { work(1000); }
void hard() { work(1000*1000*1000); }
int main() { easy(); hard(); }

Here, work() is a do-nothing loop. easy() executes a thousand iterations of that loop and hard() executes a billion iterations. We therefore expect main() to spend most of its time in hard() and only a tiny fraction in easy().

Now let's profile the program with gprof:

gcc -o try try.c -pg
./try # saves stats to gmon.out
gprof try gmon.out

On my machine, this prints the following info:

self            self    total
seconds  calls  s/call  s/call  name
   3.84      2    1.92    1.92  work
   0.00      1    0.00    1.92  easy
   0.00      1    0.00    1.92  hard

gprof's lie is marked in red; it says easy() and hard() took the same amount of time, instead of hard() taking a million times more to run.

What happened? Can we trust anything that gprof says? Which parts of its output are entirely wrong like this "easy() is the same as hard()" business and which parts are roughly correct, give or take a measurement error? To answer this, we need to briefly discuss how gprof works.

Roughly, gprof's two sources of information are profil() and mcount():

  • profil() - a cousin of creat() in that it could have been spelled with an "e" as well - updates an instruction address histogram every 10 milliseconds. That is, 100 times a second the OS looks which instruction the program is executing, and increments a counter corresponding to that instruction. So the share of increments corresponding to a function's body is proportionate to the share of time the program spent in the function.
  • mcount() is a function called by assembly code generated by gcc -pg. Specifically, when a function is entered, it calls mcount() to record a call to itself from the caller (the caller is generally easy to identify because it necessarily passes a return address to your function and that address points right into the caller's body.) So if f() calls g() 152 times, mcount(f,g) will be called 152 times.

With this in mind, we can roughly tell what gprof knows. Specifically, it knows that:

  • easy() and hard() were both called once; work(), called from each, ran twice. This info is from mcount() and it's 100% reliable.
  • The program spent almost no time in the code of easy() and hard(), and most of its time in the code of work(). This info is from profil() and it's rather reliable - because the program ran for >3 seconds, which means we had >300 increments in our instruction histogram. If almost all of these increments are in work(), that's significant enough.

What about the share of time easy() spent in its call to work(), and the share of time hard() spent in work()? By now we know that gprof knows absolutely nothing about this. So it guesses, by taking 3.84 - seconds spent in work(), a reliable number - and divides it between easy() and hard() equally because each called work() once (based on mcount(), a reliable source) - and we get 1.92. This shows how bad results can be produced from perfectly good measurements, if passed to the wrong algorithm.

More generally, gprof's output falls into the following categories, listed in decreasing order of reliability:

  • Number of calls: 100% reliable in all parts of the report. (I think; please correct me if I'm wrong.)
  • Self seconds in the "Flat profile" (time spent in a given function not including children): reliable to the extent that 100 samples/second is statistically significant given the number of hot spots and the total runtime.
  • Seconds attributed to call graph edges (contribution of children to parents, total runtime spent in self+children, etc.): possibly dead wrong. Only trust it if there's zero code reuse in a given case (that is, f() is only called by g()), or if the function in question is known to take about the same time regardless of the call site (for example, rand()).

BTW, the fact that gprof lies doesn't mean that its documentation does; on the contrary, `man gprof` says, in the BUGS section:

The granularity of the sampling is shown, but remains statistical at best. [this refers to the limited reliability of profil's histogram.] We assume that the time for each execution of a function can be expressed by the total time for the function divided by the number of times the function is called. Thus the time propagated along the call graph arcs to the function's parents is directly proportional to the number of times that  arc is traversed. [this refers to the absolutely unreliable way in which profil's "self time" data is combined with mcount's call graph data.]

Unfortunately, users tend to read tools' output without reading documentation. (The ability of users who aren't into profiling tools to understand the implications of this passage is a separate question.)

The man page also refers to papers from 1982 and 1983. An age of over three decades is a good reason to cut a program some slack. In a way, gprof's age is not only a source of its limitations, such as only 100 samples per second, but also a source of its strengths, such as fast execution of profiled code and wide availability.

Now let's look at a more modern profiler called callgrind - a valgrind plugin. Being more modern, callgrind has a few advantages over gprof - such as not lying in its call graph (though some would debate that as we'll see), and coming with a GUI program to visualize its output called KCachegrind.

KCachegrind the viewer (as opposed to callgrind the measurements collector) does lie in its call tree as opposed to call graph as we'll shortly observe. But first let's have a look at its truthful reporting of the situation with easy() being easier than hard():

KCachegrind's call graph - true

As you can see, easy() isn't even shown at the graph (KCachegrind hides things with insignificant cost); you can, however, see the cost of main's call to easy() and hard() at the source view - indeed easy() is ~1000x faster than hard().

Why 1000x and not 1000000x? Because I changed hard() to run a million iterations instead of a billion, bringing the difference down to 1000x. Why did I do that? Because callgrind is slow - it's based on Valgrind which is essentially a processor simulator. This means that you don't measure time - you measure things like instructions fetched and cache misses (which are interesting in their own right), and you get an estimation of the time the program should take given these numbers and your processor model.

It also means callgrind is slow. Is it slower than gprof? Not necessarily. With gprof, code runs at near-native speed, but you only get 100 data points per second. With callgrind you get much more data points per second. So for a hairy program, with callgrind you get statistically significant data more quickly - so effectively callgrind is faster.

But for a simple program with just a couple of hot spots, callgrind is slower because if the program has a costly part 1 and then a costly part 2, it'll take callgrind more time to even get to part 2, whereas gprof, with its near-native speed, will give you good enough data from its fast run.

So much about speed; now let's look at a case where KCachegrind lies to us, and then we'll discuss why it happens.

To expose the lie, we'll need a more complicated program. We'll achieve the required complexity by having two worker functions instead of one, and then adding a manager - a function that does nothing except calling the two workers with the number of iterations to run.

How does the manager decide the number of iterations each worker should run? Based on the project requirements, of course. Our "projects" will be two more functions, each calling the manager with its own pair of iteration numbers for the two workers.

void worker1(int n) {
  volatile int i=0;
  while(i++<n);
}
void worker2(int n) {
  volatile int i=0;
  while(i++<n);
}
void manager(int n1, int n2) {
  worker1(n1);
  worker2(n2);
}
void project1() {
  manager(1000, 1000000);
}
void project2() {
  manager(1000000, 1000);
}
int main() {
  project1();
  project2();
}
As you can see, both workers work on both projects, but each project is mostly done by one of the workers, the other contributing 1000x less work. Now let's see what KCachegrind says about this; we need to run callgrind, which can be done without special compilation flags:
gcc -o try2 try2.c
valgrind –tool=callgrind ./try2
kcachegrind `ls -tr callgrind.out.* | tail -1`
Here's what we'll see:
KCachegrind call tree - false

The bottom part of the screen shows us truths, but the top part shows falsehoods.

The truth is that each project called the manager once; the manager called each worker twice; and each worker did half the work in the program - all shown at the call graph at the bottom. However, at the top, each of the project functions occupies half the window and shows that worker1 and worker2 each did half of the work in each project - which couldn't be further from the truth.

This falsehood is reported by the call tree (or "callee map" as KCachegrind calls it) - a view which is supposed to show, for each function, the share of work done in each of its callees relative to the work done by all those callees together (as opposed to the call graph which only links to the called functions and tells how many times they were called by that caller - and their share of work in the entire program.)

Why does the call tree tell a falsehood? Again, because it doesn't know the truth. KCachegrind visualizes callgrind's measurements, which include the number of times f() called g() and the time spent in calls from f() to g().

This is more that gprof's information - way more. gprof only knows how much time was spent in f() and g(), and how many times f() called g(). Callgrind also measures how much time was spent in g() specifically when it was called from f(). In particular, this means that KCachegrind's source view gives a perfectly reliable measurement of the time spent in f plus all its callees - something which users take for granted and something which gprof only guesses, often wildly wrongly.

However, this information is not enough to know what the call tree needs knowing to show the truth. Specifically, you only know that manager() spent the same time in calls to worker1() and worker2() overall; you have no idea how much time it spent in each worker when called from project1() and project2(). So you can't reliably plot the share of time worker1() and worker2() spent inside project1() or project2(); you can only guess, often wildly wrongly.

(In fact, you can't tell if manager() even called worker1() when called from project1(); perhaps it didn't - all you know is that manager called worker1() in some context. Some people conclude that the call graph is "incorrect"; in fact it is correct, the question is if you understand what you see the way you're supposed to - you aren't supposed to think that every path through the graph actually happened, only every edge. Another question is how upset you are when you find out (someone with a lot of "manager" functions doing nothing but dispatching might be very upset.) This example certainly broadens the gray area between "truth" and "lies" in profilers' output.)

Callgrind has something which appears like a possible workaround: --separate-callers=N. What this does is it measures things per call stack of size N instead of per call "arc". That is, instead of having a measurement for the arc from manager() to worker1() and a measurement for manager()->worker2(), it measures separately for project1()->manager()->worker1(), project1()->manager()->worker2(), project2()->manager()->worker1(), etc.

Unfortunately, KCachegrind didn't manage to open the resulting output file on my machine, nor did it help when I replaced the ticks (') separating the function names (which get concatenated together) with underscores (_). However, the textual data produced by callgrind indeed shows the truth:

fn=(726) worker2'manager'project1
5 3
+1 1
+1 7000008
+1 2

fn=(736) worker2'manager'project2
5 3
+1 1
+1 7008
+1 2

This shows that worker2() did 1000x more work when called (through manager()) from project1() than it did when called from project2() - as we know it did.

Having looked into the details of two particular examples, we'll proceed to a more general discussion of profilers and profile visualization tools.

Using no profiler at all is better than hoping it'll save the day

I know a few people who like to optimize code and think optimization is important, and who mostly ignore profilers. I know a few other people who claim that a good profiler is the necessary and sufficient prerequisite for optimization. More often than not, such people are not particularly fond of optimization, and their code will be slower than the code of above-mentioned profiler-bashers, before as well as after optimization.

The examples above supposedly show a part of the reason why "a good profiler" is not at all trivial to use.

Basically among the many opinions, there are two extreme ones sounding along the lines of:

  • You don't know where your bottlenecks are going to be, therefore don't bother to optimize before measuring.
  • You don't know where your bottlenecks are going to be - nor will you be able to measure them because adequate measurement and analysis tools plus test scenarios covering the relevant possibilities are hard to come by. Therefore, conserve resources if there's even a slight chance that the code will be a bottleneck.

In my experience, the first viewpoint results in slower code and is less consistent internally. That is, for someone who's not into optimization, a profiler is just not the force multiplier that he thinks it is. That's because he won't put the mental effort required to make an effective use of the profiler for the same reasons making him write slow code in the first place: he just doesn't like all this performance stuff.

The trouble with this being that profilers are not things magically telling you what to do without concentration on your behalf.

You need to understand how a profiler works to use it well

When I first realized how gprof lies in its call graph, I was so offended that I stopped using it for a long while. The problem with profilers is that not all the data is gross lies, but some is, and without knowing which data is likely to be wrong, you might trust things that you shouldn't, scratch your head to the point of hair loss, and then abandon the tool altogether once you realize how it routinely betrays your trust.

With KCachegrind, I came up with the example where it lies based on my knowledge of the callgrind output format - something I know because we (specifically, GK) added support for that format to our in-house profiling tools at work. I wouldn't guess that the Callee Map view is unreliable otherwise. Like most users, I rarely read the docs unless something clearly refuses to work altogether. The stats in the call graph as well as the source view are perfectly reliable. How would I know some other stats aren't?

The extent to which you warn the user about possible implications of assumptions in your software is a tough question for all programmers. Should KCachegrind have a big red warning "I might be misleading you" at its "map" views? A "true power user" doesn't need the warning because they know how the tool works anyway. A "truly careless user" wouldn't read the explanation linked to from the red warning and would just be scared away. Only a "middling user" irresponsible enough to not read the docs but responsible enough to read text linked to from big red warnings might benefit from such design. But how many such users are there compared to the rest?

My own experience here is depressing - nobody, not even the smartest folks, is willing to read anything unless they came here to read. As in, when you're a tutorial that people intend to read, you can tell things to people and they listen. When you're an error message, people read you to the extent necessary to make you go away. And when you're a warning, people ignore you. It sucks to be a warning.

So I'm pessimistic about big read warnings. As to tutorials - people don't expect profilers to be complicated enough to warrant a tutorial, so they probably won't allocate time specifically to read one. They're wrong, but they won't.

Choosing a profiler is hard

There's no perfect profiler - it's a tradeoff, or rather a bunch of tradeoffs. Let's see how complicated those tradeoffs are by comparing gprof, callgrind and Google's CPU profiler:

  • gprof is fast, requires a special compilation, gives you "self time" based on 100 instruction address samples per second, precise call counts, and often bogus "children time" information.
  • callgrind is slow, requires no special compilation, gives time estimations based on event counting, several order of magnitudes more data points per second (making it "effectively faster" for some use cases but not others), precise call counts as well as precise "events counted during a call to each child" information, and comes with a viewer giving correct though possibly misleading call graph and often bogus "map" views.
  • Google's CPU profiler is fast, requires no special compilation unless you're on a 64b platform without a working unwind library, uses a configurable amount of samples per second (default: the measly 100, I wonder why), lacks precise call count information, logs full call stacks unlike gprof and callgrind without –separate-callers, but then converts data to many viewing formats where the extra info is lost (such as the callgrind format). You do get more informative view of the profiling data in some cases.

So basically each of these profilers is better than the rest in some way and worse in some other way (for instance, gprof, at first glance the awful, ancient profiler, is actually your shortest path to precise call counts, which may well be the thing you need the most). And that's before discussing things like the ease of getting a working version for your platform.

As it often is with complicated things, the choice is made harder by the fact that you don't realize many of the implications until after you gained some experience with the tool. I don't think I know what questions to ask about a new profiler, even though I'm relatively savvy. I do realize that I want to understand how it works in a lot of detail.

Not all errors are "noise"

If you're listening to an analogue recording and there's a "shhhh" sound, it's "noise". If, however, someone is yelling near you, then this louder noise isn't "noise" in the mathematical sense - it's another signal. And if someone has overwritten your original recording, then there's only another signal left and none of yours. There's also noise on top of that other signal, but that noise is the least of your troubles.

Not everything standing between you and your signal is "noise"; sometimes it's an error making you look at the wrong signal.

With profilers, people intuitively expect "measurement noise" - a profiler measures time (or some other cost) and no measurement device is perfect. People are willing to accept that - but they do expect fidelity in the assignment of costs to context. "Context" is source lines, functions, and call sequences; people (correctly) think of context as something logical where the concept of "measurement errors" doesn't apply; it's either correct or not.

Unfortunately, people are right, but the conclusion is the opposite of the natural expectation: "context" is indeed a logical concept and not a continuous quantity - and therefore, when the tools give you wrong information, it's really wrong, as in completely detached from reality, and not something "roughly right" give or take a measurement error.

(Our examples focusing on call sequences are rather benign, in a way; for real fun, we could discuss the fidelity of assigning costs to source code lines in optimized programs. Even with a nice assembly listing linking to the various source files from which code was inlined, like the listing provided by KCachegrind, it's… let's say less than easy.)

The upshot is that many errors in profilers' output are not something that can be fixed by running the program more times to get more statistically significant results. Rather, you have to know where the results are only distorted by noise and where there can also be logical errors rendering the numbers meaningless.

Presentation errors can exist without measurement errors

…As evidenced by KCachegrind's false presentation of callgrind's or Google profiler's correct results, or by gprof's false conclusions based on reasonable numbers reported by profil() and perfect numbers from mcount().

To add to the previous point, measurement errors are more likely to be "noise" than presentation errors, which are more likely to tell something unrelated to the truth.

Conclusion

Profiling is trickier than we tend to assume, and as someone developing profilers, I understand programmers who're good at optimization and who mostly ignore profilers. A profiler could help users get way more mileage if it found a way to convince them to actually read a thorough, memorable explanation of its strengths and weaknesses.