Working simultaneously vs waiting simultaneously

"Multiprocessing", "multi-threading", "parallelism", "concurrency" etc. etc. can give you two kinds of benefits:

  • Doing many things at once - 1000 multiplications every cycle.
  • Waiting for many things at once - wait for 1000 HTTP requests just issued.

Some systems help with one of these but not the other, so you want to know which one - and if it's the one you need.

For instance, CPython has the infamous GIL - global interpreter lock. To what extent does the GIL render CPython "useless on multiple cores"?

  • Indeed you can hardly do many things at once - not in a single pure Python process. One thread doing something takes the GIL and the other thread waits.
  • You can however wait for many things at once just fine - for example, using the multiprocessing module (pool.map), or you could spawn your own thread pool to do the same. Many Python threads can concurrently issue system calls that wait for data - reading from TCP sockets, etc. Then instead of 1000 request-wait, request-wait steps, you issue 1000 requests and wait for them all simultaneously. Could be close to a 1000x speed-up for long waits (with a 1000-thread worker pool; more on that below). Works like a charm.

So GIL is not a problem for "simultaneous waiting" for I/O. Is GIL a problem for simultaneous processing? If you ask me - no, because:

  • If you want performance, it's kinda funny to use pure Python and then mourn the fact that you can't run, on 8 cores, Python code that's 30-50x slower than C to begin with.
  • On the other hand, if you use C bindings, then the C code could use multiple threads actually running on multiple cores just fine; numpy does it if properly configured, for instance. Numpy also uses SIMD/vector instructions (SSE etc.) - another kind of "doing many things at once" that pure Python can't do regardless of the GIL.

So IMO Python doesn't have as bad a story in this department as it's reputed to have - and if it does look bad to you, you probably can't tolerate Python's slowness doing one thing at a time in the first place.

So Python - or C, for that matter - is OK for simultaneous waiting, but is it great? Probably not as great as Go or Erlang - which let you wait in parallel for millions of things. How do they do it? Cheap context management.

Context management is a big challenge of waiting for many things at once. If you wait for a million things, you need a million sets of variables keeping track of what exactly you're waiting for (has the header arrived? then I'm waiting for the query. has it arrived? then I ask the database and wait for it etc. etc.)

If those variables are thread-local variables in a million threads, then you run into one of the problems with C - and hence OS-supported threads designed to run C. The problem is that C has no idea how much stack it's gonna need (because of the halting problem, so you can't blame C); and C has no mechanism to detect that it ran out of stack space at runtime and allocate some more (because that's how its ABIs have evolved; in theory C could do this, but it doesn't.)

So the best thing a Unixy OS could do is, give C one page for the stack (say 4K), and make say the next 1-2M of the virtual address space unaccessible (with 64b pointers, address space is cheap). When C page-faults upon stack overflow, give it more physical memory - say another 4K. This method means at least 4K of allocated physical memory per thread, or 4G for a million threads - rather wasteful. (I think in practice it's usually way worse.) All regardless of us often needing a fraction of that memory for the actual state.

And that's before we got to the cost of context switching - which can be made smaller if we use setjmp/longjmp-based coroutines or something similar, but that wouldn't help much with stack space. C's lax approach to stack management - which is the way it is to shave a few cycles off the function call cost - can thus make C terribly inefficient in terms of memory footprint (speed vs space is generally a common trade-off - it's just a bad one in the specific use case of "massive waiting" in C).

So Go/Erlang don't rely on the C-ish OS threads but roll their own - based on their stack management, which doesn't require a contiguous block of addresses. And AFAIK you really can't get readable and efficient "massive waiting" code in any other way - your alternatives, apart from the readable but inefficient threads, are:

  • Manual state machine management - yuck
  • Layered state machines as in Twisted - better, but you still have callbacks looking at state variables
  • Continuation passing as in Node.js - perhaps nicer still, but still far from the smoothness of threads/processes/coroutines

The old Node.js slides say that "green threads/coroutines can improve the situation dramatically, but there is still machinery involved". I'm not sure how that machinery - the machinery in Go or Erlang - is any worse than the machinery involved in continuation passing and event loops (unless the argument is about compatibility more than efficiency - in which case machinery seems to me a surprising choice of words.)

Millions of cheap threads or whatever you call them are exciting if you wait for many events. Are they exciting if you do many things at once? No; C threads are just fine - and C is faster to begin with. You likely don't want to use threads directly - it's ugly - but you can multiplex tasks onto threads easily enough.

A "task" doesn't need to have its own context - it's just a function bound to its arguments. When a worker thread is out of work, it grabs the task out of a queue and runs it to completion. Because the machine works - rather than waits - you don't have the problems with stack management created by waiting. You only wait when there's no more work, but never in the middle of things.

So a thread pool running millions of tasks doesn't need a million threads. It can be a thread per core, maybe more if you have some waiting - say, if you wait for stuff offloaded to a GPU/DSP.

I really don't understand how Joe Armstrong could say Erlang is faster than C on multiple cores, or things to that effect, with examples involving image processing - instead of event handling which is where Erlang can be said to be more efficient.

Finally, a hardware-level example - which kind of hardware is good at simultaneous working, and which is good at simultaneous waiting?

If your goal is parallelizing work, eventually you'll deteriorate to SIMD. SIMD is great because there's just one "manager" - instruction sequencer - for many "workers" - ALUs. CPUs, DSPs and GPUs all have SIMD. NVIDIA calls its ALUs "cores" and 16-32 ALUs running the same instruction "threads", but that's just shameless marketing. A "thread" implies, to everyone but a marketeer, independent control flow, while GPU "threads" march in lockstep.

In practice, SIMD is hard despite thousands of man-years having been invested into better languages and libraries - because telling a bunch of dumb soldiers marching in lockstep what to do is just harder than running a bunch of self-motivating threads each doing its own thing.

(Harder in one way, easier in another: marching in lockstep precludes races - non-deterministic, once-in-a-blue-moon, scary races. But races of the kind arising between worker threads can be almost completely remedied with tools. Managing the dumb ALUs can not be made easier with tools and libraries to the same extent - not even close. Where I work, roughly there's an entire team responsible for SIMD programming, while threading is mostly automatic and bugs are weeded out by automated testing.)

If, however, you expect to be waiting much of the time - for memory or for high-latency floating point operations, for instance - then hoards of hardware threads lacking their own ALUs, as in barrel threading or hyper-threading, can be a great idea, while SIMD might do nothing for you. Similarly, a bunch of weaker cores can be better than a smaller number of stronger cores. The point being, what you really need here is a cheap way to keep context and switch between contexts, while actually doing a lot at once is unlikely to be possible in the first place.

Conclusions

  • Doing little or nothing while waiting for many things is both surprisingly useful and surprisingly hard (which took me way too long to internalize both in my hardware-related and software/server-related work). It motivates things looking rather strange, such as "green threads" and hardware threads without their own ALUs.
  • Actually doing many things in parallel - to me the more "obviously useful" thing - is difficult in an entirely different way. It tends to drag in ugly languages, intrinsics, libraries etc. about as much as having to do one single thing quickly. The "parallelism" part is actually the simplest (few threads so easy context management; races either non-existent [SIMD] or very easy to weed out [worker pool running tasks])
  • People doing servers (which wait a lot) and people doing number-crunching (work) think very differently about these things. Transplanting experience/advice from one area to the other can lead to nonsensical conclusions.

See also

Parallelism and concurrency need different tools - expands on the reasons for races being easy to find in computational code - but impossible to even uniformly define for most event handling code.

Can your static type system handle linear algebra?

It seems a good idea to use a static type system to enforce unit correctness - to make sure that we don't add meters to inches or some such. After all, spacecraft has been lost due to such bugs costing hundreds of millions.

Almost any static type system lets us check units in sufficiently simple cases. I'm going to talk about one specific case which seems to me very hard and I'm not sure there's a type system that handles it.

Suppose we have a bunch of points - real-world measurements in meters - that should all lie on a straight line, modulo measurement noise. Our goal is to find that line - A and B in the equation:

A*x + B = y

If we had exactly 2 points, A and B would be unique and we could obtain them by solving the equation system:

A*x1 + B = y1
A*x2 + B = y2

2 equations, 2 unknowns - we're all set. With more than 2 measurements however, we have an overdetermined equation system - that is, the 3rd equation given by the measurement x3, y3 may "disagree" with the first two. So our A and B must be a compromise between all the equations.

The "disagreement" of x,y with A and B is |A*x + B - y|, since if A and B fit a given x,y pair perfectly, A*x+B is exactly y. Minimizing the sum of these distances from each y sounds like what we want.

Unfortunately, we can't get what we want because minimizing functions with absolute values in them cannot be done analytically. The analytical minimum is where the derivative is 0 but you can't take the derivative of abs.

So instead we'll minimize the sum of (A*x + B - y)^2 - not exactly what we want, but at least errors are non-negative (good), they grow when the distance grows (good), and this is what everybody else is doing in these cases (excellent). If the cheating makes us uneasy, we can try RANSAC later, or maybe sticking our head in the sand or something.

What's the minimum of our error function? Math says, let's first write our equations in matrix form:

(x1 1)*(A B) = y1
(x2 1)*(A B) = y2
...
(xN 1)*(A B) = yN

Let's call the matrix of all (xi 1) "X" and let's call the vector of all yi "Y". Then solving the equation system X'*X*(A B) = X'*Y gives us A, B minimizing our error function. Matlab's "\" operator - as in X\Y - even does this automatically for overdetermined systems; that's how you "solve" them.

And now to static type systems. We start out with the following types:

  • x,y are measured in meters (m).
  • A is a number (n)- a scale factor - while B is in meters, so that A*x + B is in meters as y should be.

What about X'*X and X'*Y? X'*X looks like this:

sum(xi*xi) sum(xi) //m^2 m
sum(xi)    sum(1)  //m   n

….while X'*Y looks like this:

sum(xi*yi) sum(yi) //m^2 m

The good news is that when we solve for A and B using Cramer's rule which involves computing determinants, things work out wonderfully - A indeed comes out in meters and B comes out as a plain number. So if we code this manually - compute X'*X and X'*Y by summing the terms for all our x,y points and then hand-code the determinants computation - that could rather easily be made to work in many static type systems.

The bad news is that thinking about the type of Matlab's \ operator or some other generic least-squares fitting function makes my head bleed.

Imagine that large equation systems will not be solved by Cramer's rule but by iterative methods and all the intermediate results will need to have a type.

Imagine that we may be looking, not for a line, but a parabola. Then our X'*X matrix would have sum(x^4), sum(x^3) etc. in it - and while spelling out Cramer's rule works nicely again, a generic function's input and output types would have to account for this possibility.

Perhaps the sanest approach is we type every column of X and Y separately, and every element of X'*X and X'*Y separately - not meters^something but T1, T2, T3… And then we see what comes out, and whether intermediate results have sensible types. But I just can't imagine the monstrous types of products of the various elements ever eventually cancelling out as nicely as they apparently should - not in a real flesh-and-blood programming language.

Maybe I'm missing something here? I will admit that I sort of gave up on elegance in programming in general and static typing in particular rather long ago. But we have many static type systems claiming to be very powerful and concise these days. Maybe if I did some catching-up I would regain the optimism of years long gone.

Perhaps in some of these systems, there's an elegant solution to the problem of \'s type - more elegant than just casting everything to "number" when it goes into \ and casting the result back into whatever types we declared for the result, forgoing the automatic type checking.

So - can your static type system do linear algebra while checking that measurement units are used consistently?

C++11 FQA anyone?

isocpp.org announced "the C++ FAQ" to which reportedly Bjarne Stroustrup and Marshall Cline will link to help boost its search rank. Good stuff; also, given that the language is >30 years old - it's about time.

Which reminds me. I'm failing to keep my promise to update the C++ FQA, an overly combative reply to Marshall Cline's fairly combative FAQ, to the new standard from hell - C++11.

Could you perhaps do it instead of me? Unfortunately, explaining how terrible C++ is cannot be made into a career the way explaining how wonderful C++ is can be. So I sort of moved on to other things; I managed to summon the mental energy just that one time to write the C++ FQA, and even as I wrote it I knew that I better hurry because that energy was waning. After all, at work I managed to reduce the amount of C++ I deal with to a minimum, so there was neither a profit motive nor continuous frustration to recharge me.

Not that I haven't made money off the C++ FQA - indirectly I did make money, specifically by getting a headhunter's attention and significantly improving my employment conditions. But I don't think a new FQA's maintainer/co-author could count on something along those lines.

Still, you'd become about as widely famous in narrow circles as I am, getting maybe 200K unique visitors per year and hundreds of "thank you" emails. Most importantly, you'd do the world a great service.

There's a curious double standard in the world of programming languages. Say, PHP is widely ridiculed because, for instance, its string comparison operator converts strings to numbers if they start with "0e" and this results in unexpected behavior. And it doesn't help PHP that === works just fine.

Imagine someone mocking C++ for two char pointers comparing "wrong" with ==. Immediately they'd be told that casting to std::string (more verbose than ===) would work just fine, and that "you just don't get it".

Why is PHP - a language full of quirks which at least gives you memory and type safety - is universally ridiculed and the developers, while defending the language, never ridicule back, while C++, an absolutely insane language, is ridiculed often enough but C++ developers always counter-attack viciously? For that matter, why isn't Lisp ridiculed for EQ, EQL, EQUAL and EQUALP, if comparison operators are so funny?

The reason, IMO, is simple: PHP is not taught in academic CS courses. C++ developers are much more likely to have a CS degree, therefore both they and others treat their knowledge of crazy C++ arcana as something of intellectual value. And Lisp is the poster child of academic programming language development. Educated proponents deter attempts at ridiculing a language.

In a "rational" universe, PHP would be held in high esteem given the amount of output produced by PHP developers with little training. In our universe we have this double- or triple-standard. Pointing out the darker aspects of C++ - which are most of its aspects - thus increases the supply of a rather scarce commodity.

But why C++ and not, say, Lisp, Haskell or C#?

One reason is that C++ is arguably the craziest language in widespread use, combining the safety of C (and I can live with C just fine) with the clarity of Perl (and I can live with Perl peacefully enough). But it is of course subjective.

The thing that really sets C++ apart is its development culture and value system - a perverse amalgamation of down-to-earth shrewdness and idealistic perfectionism. The whole idea of making the best, richest, most efficient and most generic/versatile programming language of the planet - you can sense that Bjarne Stroustrup aims at nothing less - ON TOP OF C is the perfect illustration of this culture, and perhaps its origin.

Stroustrup always knew that the language would be better if it didn't have to be compatible with C and he publicly acknowledged it - the "smaller, much simpler language struggling to come out" remark. But he also knew that using an Embrace, Extend and Exterminate strategy is a much more likely way to succeed. So he did that. In fact he managed to more or less kill C or at least put it in a coma - as even C99, not to mention C11, will never be supported by the Microsoft compiler, and C89 is a really old and really restrictive standard.

A shrewd move - almost an evilly shrewd one - netting the people behind C++ fame and fortune at the expense of programmers dealing with a uniquely dangerous language full of sugar-coated death traps.

(Yes, sugar-coated death traps, you clueless cheerleaders. X& obj=a.b().c() - oops, b() is a temporary object and c() returns a reference into it! Shouldn't have assigned that to a reference. Not many chances for a compiler warning, either.)

Who did things differently? Sun and Microsoft, for example, marketing Java and C#, respectively. They made much cleaner languages from scratch, and to solve the chicken-and-egg problem - we have no legacy projects hence no programmers hence no new projects hence no programmers - they used money, large marketing budgets and large budgets for creating large standard libraries. A much more honest approach yielding much better results, I find.

And Stroustrup says about himself that he "lacks marketing clout" and says Java and C# are bad for you because they're "platforms, not languages", whatever that means.

And I'm not claiming to be able to read minds, but if I had to bet - I'd say he really believes that. He probably thinks he's your altruistic benefactor and Java and C# are evil attempts to drag you into proprietary platforms.

And you can see the same shrewdness, the same "altruism", the same attention to detail, the same tunnel vision - "this shit I'm working on is so important, it deserves all of my mental energy AND the mental energy of my users at the expense of caring about anything else" - throughout the C++ culture. From the boost libraries to "Modern C++ Design" (the author has since repented and moved to D - or did he repent?..) to the justifications for duplicate and triplicate and still incomplete language features to your local C++ expert carrying his crazy libraries and syntactic wrappers and tangling your entire code base in his net.

And this approach to life - "altruism" plus perfectionism plus cleverness plus shrewdness - extends way beyond C++ and way beyond programming. And the alternative is taming your ambitions.

This was my larger purpose in writing about all this shit. I'm pretty sure I failed in the sense that it got drowned in C++ error messages and other shits and giggles.

So maybe you'll do better than me. If you do, you might contribute to the sanity of many a young idealistic programmer - these tend to get sucked into C++'s sphere of influence, either emerging old and embittered years later or lost to sanity forever, stuck in an internally consistent but absolutely crazy way of thinking.

BTW I never wanted it to be a personal attack, in the sense that (1) I don't know what's inside the heads of people who promote C++ and (2) I can tell you with certainty that if I made something 10 times as bad as C++ and it was 0.0001x as popular as C++, I'd be immensely proud of myself and I wouldn't give a damn about what anyone thought. Just like I don't give a damn about people with so much time on their hands to actually have thousands of karma points at StackOverflow explaining just how lame C++ FQA is.

In this sense, C++ is fine and a worthy achievement of a lifetime. Especially in a world where Putin is a candidate for a Nobel Peace prize and Obama already got one.

I basically just think that (1) C++'s horrible quirks are worth pointing out and (2) there's something to be learned from this story about the way we pave the road to hell with our good intentions. That's all.

***

A lot of people have spoken about "a C++ renaissance" when C++11 was ratified. I tend to agree - indeed the new standard is a fresh doze of the same thing that C++ always was: "zero-overhead" pretty-looking syntax with semantics quite horrendous once you think what it actually means.

Fittingly for a new revision of the C++ standard, more things we could once safely count on have gone with the wind. For instance, anything you could pass to a function used to be an expression, and expressions had one and only one type. How ironic that this revision, the revision that finally capitalized on this fact by introducing auto, also made this fact no longer true: {0} could be an int or an std::initializer_list, depending on the context. Context-dependent types were one thing that Perl had (scalar/vector context) but C++ didn't have.

(I have observed someone do this: _myarr[5]={0}; - they had in the .h file the definition int _myarr[5] and they remembered that this thing could be initialized with {0} in other contexts. What they did wouldn't compile in C++98; in C++11 it promptly assigned the int 0 to the non-existent 5th element of _myarr, and the usual hilarity ensued. Imagine how PHP would be ridiculed for this kind of little behavior - and PHP at least would never overwrite an unrelated variable with garbage. Imagine how with C++, the poor programmer will be ridiculed instead.)

This is nothing, BTW - it's not the kind of thing the FQA would normally poke fun at. We have bigger fish to fry. For instance, C++11's "closures" or "lambdas" - the preposterous thing where you say [](){…} and an anonymous struct gets generated. BTW it's one of the reasons I urged everyone where I work to upgrade to C++11; because it lets you implement a nicely-looking parallel_for. In C I would have added a compiler extension for it AGES AGO but extending the C++ grammar? No sir. I waited patiently until C++11 lambdas.

So, C++11 lambdas. Someone needs to write everything about those hideous capture lists - are references captured by value or by reference?! - and how you can end up passing dangling references if this closure thingie outlives variables it references (we don't need no stinking garbage collection! but why doesn't C++11 let one capture variables by std::shared_ptr?.. It's so much better than gc!), and about the type of this shit and how it looks in debuggers, and about std::function etc. etc.

And this won't be me because frankly, I don't have time to even fully master this arcana anymore.

If you dislike C++ and have the time to write about it, I'll gladly pass the torch on to you; specifically I'll redirect C++ FQA to point to your site, following the example of Bjarne Stroustrup and Marshall Cline. Or we could run a wiki, or something. Email or comment if you're interested.

A simple way to "get more people to code"

Disclaimers

  • I'm not a citizen of a country where "getting more people to code" is a widely shared concern. While my proposal seems to me like a simple solution to a simple problem, this is merely an outsider's observation.
  • Recently the tide of calls to "code" seems to have subsided. If it's no longer a pressing social problem, I regret being late with my excellent solution.
  • Like all brilliant ideas, my solution is simple, and I was surprised by not seeing it proposed in the voluminous discussions of the topic. I apologize to those who've proposed the same thing earlier without me noticing for not giving them proper credit.

The problem

For a long time now, people have been voicing a concern about a supposed shortage of computer programmers. Many also express narrower concerns, such as a shortage of children, women or people with a certain skin color in the ranks of computer programmers.

A high-profile example is US President Barack Obama. This US President went further than urging others to "code" and regretted not being able to do so himself: "I wanted to go in and fix <healthcare.gov> myself, but I don't write code."

(I understand President Obama. I once wanted to go in and fix international politics. Then I realized that I don't shoot bullets.)

People from technology companies have also urged others to learn programming - for example, Mark Zuckerberg and Bill Gates.

Finally, individuals such as NBA star Chris Bosh encouraged people to learn to code in their private capacity.

Even this humble blogger was asked to give an interview by someone interested to encourage people in general and women in particular to code! And you know why? Surprisingly, because of a piece I wrote where I told that it was money that attracted me to programming, and it is money that keeps me in programming.

It was then when a solution to the problem at hand popped to my mind.

A possible solution

If there's a shortage of programmers, we could pay programmers more money.

How can this be arranged, and how would it solve the problem? There are several possibilities.

For example, companies such as Google, Apple and Intel could raise wages, causing some people to choose programming over other occupations. Once a sufficient amount of people are attracted to programming, wages would fall.

Alternatively, governments such as that headed by Barack Obama could lower programmers' income tax, or introduce a negative tax for programming. Again rising wages would attract more people to programming. A government convinced that the number of programmers reached a high enough level could then cancel the subsidy.

Finally, concerned individuals such as the wealthy basketball player Chris Bosh could donate money to computer programmers until enough people are attracted to the profession.

Having outlined a solution to the general problem, we now direct our attention to the narrower concerns of representation of particular groups of people in computer programming.

Discrimination affirmative action

Companies could increase the representation of younger people in programming by paying extra money to programmers below a certain age.

Similarly, governments could give tax breaks to younger programmers, or to parents of children who regularly submit their code for review by public officials.

Finally, a trust fund could be set up paying children to learn to program.

A similar approach should in principle be applicable to the problems of increasing the representation of women, people of particular races or other groups.

In some cases, questions of legality arise; for instance, while a trust fund for the benefit of people of one race but not others is probably legal, for-profit companies and governments might not be able to legally discriminate based on race.

However, this problem could be overcome, either by changing the laws or by setting aside a sum of money equal in size to the discriminatory subsidy paid to the members of the group in question. Once enough people from that group enter the programming profession, that sum of money can be divided between working programmers not belonging to the group, in effect undoing the discrimination.

Or something along these lines.

Conclusion

If this strikes you as absurd, does it strike you as even more absurd that people claim something to be a problem when its "solution" is as obvious as it is ridiculous? (Or is it really that ridiculous? Farm subsidies exist. Why not FarmVille subsidies?)

If people don't "learn to code", maybe the option is insufficiently attractive to those who can, given current wages, lifetime employment prospects, and the complete uselessness of the skill outside work.[1]

If companies don't raise wages, maybe they don't feel a very pressing need for more programmers. (Of course they'd still "encourage" people to program - as long as the cost of encouragement is likely to be offset by a fall of wages, once a surplus of programmers results from the encouragement.)

If a situation persists, maybe there's a good reason for it.

[1] It is not completely fair to deny programming its uses outside work. A more fair presentation is an analogy with today's hobbyist 3D printing. An owner of a 3D printer recently told me that "having one really exposes the impotence of… not having one. For instance, I needed this little thingie to hold a shelf. Took 30 minutes to design and print. And where would I get it otherwise?!" The answer, of course, is "at a nearby store" where they have a box full of these thingies at about 20 cents apiece. Of course, in a couple thousand years, his investment in the 3D printer will repay itself though the continuous printing of thingies.

Further reading

Women in Science by Philip Greenspun is a must-read for anyone interested in increasing the representation of group X in occupation Y:

Adjusted for IQ, quantitative skills, and working hours, jobs in science are the lowest paid in the United States.

This article explores this … possible explanation for the dearth of women in science: They found better jobs.

Very funny, gdb. Ve-ery funny.

Have you ever opened a core dump with gdb, tried to print a C++ std::vector element, and got the following?

(gdb) p v[0]
You can't do that without a process to debug.

So after seeing this for years, my thoughts traveled along the path of, we could make a process out of the core dump.

No really, there used to be Unices with a program called undump that did just that. All you need to do is take the (say) ELF core dump file and generate an ELF executable file which loads the memory image saved in the core (that's actually the easier, portable part) and initializes registers to the right values (the harder, less portable part). I even wrote a limited version of undump for PowerPC once.

So we expended some effort on it at work.

And then I thought I'd just check a live C++ process (which I normally don't do, for various reasons). Let's print a vector element:

(gdb) p v[0]
Could not find operator[].
(gdb) p v.at(0)
Cannot evaluate function — may be inlined.

Very funny, gdb. "You can't do that without a process to debug". Well, I guess you never did say that I could do that with a process to debug, now did you. Because, sure enough, I can't. Rolling on the floor, laughing. Ahem.

I suggest that we all ditch our evil C arrays and switch to slow-compiling, still-not-boundary-checked, still-not-working-in-debuggers-after-all-these-YEARS std::vector, std::array and any of the other zillion "improvements".

And gdb has these pretty printers which, if installed correctly (not easy with several gcc/STL versions around), can display std::vector - as in all of its 10000 elements, if that's how many elements it has. But they still don't let you print vec[0].member.vec2[5].member2. Sheesh!

P.S. undump could be useful for other things, say a nice sort of obfuscating scripting language compiler - Perl used to use undump for that AFAIK. And undump would in fact let you call functions in core dumps - if said functions could be, um, found by gdb. Still, ouch.

P.P.S. What gdb prints and when depends on things I do not comprehend. I failed to reproduce the reported behavior in full at home. I've seen it for years at work though.

Delayed printf for real-time logging

The article is at embeddedrelated.com. The nice bit is, you save format string pointers to files, and you read the strings later directly from the executable, and then do the formatting. There are a few tricks related to scripting gdb in Python and other such things. The upshot is that you get to do free-form text logging from virtually any context - including things like interrupt handlers, etc.

Coroutines in one page of C

I've written a single page implementation of coroutines in C using setjmp/longjmp and just a little bit of inline assembly. The article is at embeddedrelated.com. Here's an example C program using coroutines - equivalent to the Python code:

def iterate(max_x, max_y):
  for x in range(max_x):
    for y in range(max_y):
      yield x,y

for x,y in iterate(2,2):
  print x,y

In C:

#include <stdio.h>
#include "coroutine.h"

typedef struct {
  coroutine* c;
  int max_x, max_y;
  int x, y;
} iter;

void iterate(void* p) {
  iter* it = (iter*)p;
  int x,y;
  for(x=0; x<it->max_x; x++) {
    for(y=0; y<it->max_y; y++) {
      it->x = x;
      it->y = y;
      yield(it->c);
    }
  }
}

#define N 1024

int main() {
  coroutine c;
  int stack[N];
  iter it = {&c, 3, 2};
  start(&c, &iterate, &it, stack+N);
  while(next(&c)) {
    printf("x=%d y=%d\n", it.x, it.y);
  }
}

Yeah it's a bit longer, or perhaps more than a bit longer. But it's still very useful at times. Check it out!

Do call yourself a programmer, and other career advice

This is a (very late) reply to Patrick McKenzie's "Don't Call Yourself A Programmer, And Other Career Advice". I find much of his advice very sensible, and it might be very helpful to someone in the beginning of their career - assuming they can act upon it (and I really don't know whether my 20-year-old self could actually use the advice to improve his negotiation skills, for example).

A few things in the article I disagree with, however. Here I'll mostly focus on those few things, recommending you to read the original article so that you don't miss the rest of it.

"Disagree" is not necessarily the right word - a more precise way to put it would be "it's different in my experience". Which is to be expected because both of us are speaking based on our own careers, which have been rather different. Patrick McKenzie is a small business owner running Bingo Card Creator and a successful consultant. I'm a lead chip architect at a billion-dollar company. Both of us have thus traveled some distance away from "purely programming" (whatever that means), but in rather different directions.

What company are you going to work for?

Patrick McKenzie says 90% of the jobs involve things like implementing an internal travel expense reporting form, rather than a product shipped to external customers. He advises you to get used to the idea, even though such software is "soul-crushingly boring" as he puts it.

How bad is it, and is it really 90% of the jobs? Spolsky thinks it's maybe 80% - and that it's bad enough to "drain the life out of you". He goes on to elaborate why it "sucks to be an in-house programmer":

  • There's rarely a business reason to improve in-house software past the point of "barely good enough". "Forget any pride of craftsmanship - you're going to churn out embarrassing junk".
  • At software companies, what you do is more directly related to the way the company makes money, so you're more likely to be respected. "A programmer is never going to rise to become CEO of Viacom, but you might well rise to become CEO of a tech company." "…no matter how critical it was for Viacom to get this internet thing right, when it came time to assign people to desks, the in-house programmers were stuck with 3 people per cubicle in a dark part of the office".

Note that McKenzie and Spolsky are in almost complete agreement over these points. But then Spolsky says you should be gunning for a position in a software company - the environment where creatures of your kind naturally thrive. Conversely, McKenzie explains how to prosper as a programmer outside software companies - moving in the opposite direction of where things go by default (being stuck in a dark part of the office while they're trying to outsource your job.)

So the question is which path you prefer. "Not so fast", you say: one of these jobs is way easier to land - 80-90% of the chances are you're not getting inside a software company - so it's not just a question of preference.

Here I disagree: even if only 10-20% of programmers work in software companies (where are the stats?..), and even if they're "the best" (according to what metric?), McKenzie himself says in that same article:

You radically overestimate the average skill of the competition because of the crowd you hang around with:  Many people already successfully employed as senior engineers cannot actually implement FizzBuzz.

But if competition is relatively unskilled on average, you probably can land a job in the 10-20% of the sector that you want - as did most people who graduated around the time I did. So I rather firmly believe that it's a matter of choice: do you want to work on in-house software or one-off businessy projects of that kind, or do you prefer a software company?

Let's proceed to McKenzie's advice to in-house programmers - which should in itself help one make that choice.

How to call yourself

One such advice is:

Don't call yourself a programmer. “Programmer” sounds like “anomalously high-cost peon who types some mumbo-jumbo into some other mumbo-jumbo.” Instead, describe yourself by what you have accomplished for previous employers vis-a-vis increasing revenues or reducing costs.

Sure - an in-house programmer is likely doing some type of expensive mumbo-jumbo in the eyes of his non-technical MBA-wielding manager.

To me, however, a programmer is who I'm looking for, while a resume full of revenue increases and cost reductions sounds like an "anomalously high-cost parasite who types some mumbo-jumbo into Excel and PowerPoint, claiming credit for others' work".

McKenzie says a software company looks at this just like a company hiring internal programmers, essentially. His example is "the guy who wrote the backend billing code that 97% of Google’s revenue passes through - he’s now an angel investor". The guy apparently got rich by being near a "profit center" rather than through his unusual skills.

The thing is, in this case I believe he's talking about Ron Garret, the PhD from NASA's Jet Propulsion Laboratory. Do you think they hired him because he described his work at the JPL in terms of revenues and costs? (BTW he didn't like working on the billing code, bought his stock options and quit, instead of choosing a career at the company's biggest "profit center".)

Did any unusual skills go into the billing code? Ron Garret says:

I did end up writing the credit card billing and accounting system, which is a nontrivial thing to get right. Fortunately for me, just before coming to Google I had taken some time to study computer security and cryptography, so I was actually well prepared for that particular task. …I designed the billing system to be secure against even a dishonest employee with root access (which is not such an easy thing to do). I have no idea if they are still using my system, but if they are then I'd feel pretty confident that my credit card number was not going to get stolen.

Sounds to me that his technical knowledge and programming ability was the bulk of his contribution, whereas deep thoughts such as realizing that there will be some "cost reduction" due to not having credit card numbers stolen is not something an employer needs to hire anyone for.

So if I ever send out a resume as a chip architect, I will focus on my technical role in transitioning from fixed-function hardware accelerators to programmable processors, more than the manpower this saved and the business we won as a result (which I think were real outcomes of our work, but which is rather hard to quantify - as these things often are unless you're a business-friendly-sounding liar.)

Incidentally, I'm not sure when I'll send out that resume, which brings us to the next point.

On job hopping, backstabbing, and the lack thereof

Co-workers and bosses are not usually your friends: You will spend a lot of time with co-workers.  You may eventually become close friends with some of them, but in general, you will move on in three years…

<your boss will> attempt to do things that none of your actual friends would ever do, like try to talk you down several thousand dollars in salary or guilt-trip you into spending more time with the company when you could be spending time with your actual friends.  You will have other coworkers who — affably and ethically — will suggest things which go against your interests…

There is a certain internal consistency to a view that your coworkers are not your friends, because you will move on in 3 years. In fact, it's a bit circular. They aren't your friends - because you'll move on. And why will you move on? Well, I dunno, maybe for a 10% salary increase. What's there to lose? Relationships with coworkers? But coworkers aren't your friends!

Again, I don't disagree, but rather offer an alternative view, equally internally consistent. I have stayed at one job for more than a decade, in large part because I'm rather attached to the people I work with. To be sure, I got raises, and I was ready to quit over employment terms - but it'd take much more than 10%.

Isn't it just a quantitative difference in preferences - a 10% raise not being fundamentally different than, say, 100%? Well, sufficiently large quantitative changes add up to qualitative changes, as Marxian dialectics or some other Soviet philosophy thingie that my parents sometimes quote taught us. What's going on is that both approaches can lead to career advancement, but they do so very differently.

If you're willing to change jobs over a small raise, you'll be changing them frequently. You won't get attached to people, or to the work you're doing together. You will be very good at finding jobs and you will know what's generally going on in the industry and what's in demand. You will not know that many things specific to any of your employers. You and your employer will become very useful to each other fairly quickly, but you'll also be somewhat expendable for each other.

Alternatively, you can keep a job as long as it's a fun environment, requiring a significant raise once in a while. Your relationships with people combined with your long-term outlook can let you do things together that you otherwise couldn't plan or execute, and learn things you wouldn't have learned.

Much of my knowledge about chip design comes from ASIC hackers I worked with, and their willingness to develop their biggest ideas together with me came from trust that necessarily took time to build. It takes time to learn that none of you is in the habit of "suggesting things going against the other's interest", or pulling other unfriendly shenanigans.

Incidentally, if you stay at one place for a long while, then your worth to the employer grows to the point where you can get the significant raise that you'd quit over without actually quitting. Your worth can also grow well above what employers are willing to pay to experienced new hires, so there's no longer a point in switching jobs. This is somewhat analogous to becoming a consultant after having switched a whole lot of jobs and now making more than the next job hop could give you.

Both approaches work, though I don't have stats showing which tends to be more effective. I do believe that the long-term approach is more fun. I could never land the kind of gig that I have now through job hopping. More importantly, I wouldn't have the relationships that I have at work.

"More importantly", because all means to reach our ends often fail, and then all we're left with is our means. You can't count on any career strategy to give you either a dream job or a load of money; it'll work to some extent or other but who knows. What you can count on is your lifestyle being affected rather predictably by your career choices. The impact of these choices on relationships could thus be weighted as more important than the impact on career advancement because it's more predictable.

The part about bosses is the only one I very much agree with. (I had enough bosses to be able to plausibly deny that I'm thinking about any particular one here.) Yes, some of them will want you to work more time for less money (by itself a natural desire for an employer) while attempting to look like your friends (which is where it becomes a tad irksome). This just means that you should guard your own interests (as always) - and perhaps not judge people too harshly before spending time in their shoes.

How to value an equity grant

McKenzie says you shouldn't value equity very much, and he doesn't spend many of words to say it. I'll talk about stock options, which are worse than an actual equity grant and which is the only thing I've ever been offered.

My basic outlook is again long-term. I work at a private company whose value rose almost tenfold over the decade I've been there. And it's still a private company, so there's never been an easy venue to make money off most of the stock options.

From a long-term view, stock options look worse - and better.

Worse, because having stock options ties your hands behind your back. You usually can't afford to buy them when you quit, or at least buying them is a significant risk that you might be reluctant to take. If the company survives for a long while, then you may start to dislike the place but the hope of making money off your stock options now makes it harder to quit. If you generally like the place, options make it harder to negotiate a raise, since they know you can't quit.

So in the long term, options can effectively be a liability.

On the other hand, as the company matures, its stock options tend to get undervalued by employees, and for no good reason. People intuitively think along the lines of, "it's already expensive - how much can a price rise from now on?" It's a natural thought if the price has went up threefold or tenfold already.

But what this misses is that you don't get paid in percentage points - you get dollars. A $100 share going up 20% to $120 means you make $20 per share. A $5 share going up 100% to $15 means you only make $10 per share. Stock options of a mature company whose price is still rising can thus be even nicer than stock options of a young company which rises more quickly but which is still cheap - and is more likely to go bust overnight.

The upshot is that people overvalue stock options early on - but they also often undervalue them later on.

Note that if you don't intend to stay for more than 3 years, than stock options are most certainly a liability because they make it harder to quit - while the chances that the company makes it big in that span of time are very low.

Working at a startup

McKenzie lists valid reasons not to. In terms of job satisfaction, he says you can work on many exciting things in large corporations, not just startups.

Here's one thing in favor of startups. A large corporation usually doesn't have huge gaping holes that it doesn't know how to deal with or doesn't even notice. A startup often does have many such gaping holes, because, well, nothing is established yet, they don't even understand what they're doing, and most importantly, they are severely understaffed.

This means that you can grab pretty much any responsibility that you want to. There will be areas that people are competing to work on everywhere, but in a startup doing something hard enough, there will be a ton of hard problems nobody is competing to solve because there's not enough time or people for everything. You can be the person pointing out that problem and grabbing that responsibility.

As companies mature, being able to just work on whatever you want gets harder. My metaphor for it is nomadic programmers moving from problem to problem vs settlers with states and national borders where even visiting your neighbor's code may involve a visa.

This isn't a recommendation to work for startups, just one thing worth pointing out. The counterpoint is that if you're an orderly person who wants an orderly process, then a larger company known for its development culture is probably a better idea.

Impact of career on life happiness

At the end of the day, your life happiness will not be dominated by your career.

In one way, I agree wholeheartedly; whatever the merits of a job, it's a job, and I actually noticed my productivity fall at times of treating it as more important than that. The healthy way of looking at it is "just a job, at the end of the day".

On the other hand, we do spend quite some time at work. The question is, to what extent does it make sense to separate "work" from "life" - and to what extent it's one part of life among many, to be treated similarly to those other parts of life?

I argue that the "work/life" separation shouldn't be strong enough to separate "coworkers" into a distinct category of human beings with whom relationships are formed fundamentally differently - nor is it necessarily great to be emotionally detached from the workplace to be always ready to abandon it and "move on".

(I'm not arguing that McKenzie's intent was to say the exact opposite of what I'm saying, BTW. I'm just commenting on some quotes and the general atmosphere of the text as I perceived it. A lot of things simply have different meaning when heard by different people; a simple advice like "be wary of others' intentions" is great for someone overly trusting, but not for someone already verging on paranoia. Some people need to hear that coworkers aren't friends; today I'm writing for the other people.)

Summary

When I introduce myself, I usually call myself a programmer, regardless of my current work on chip architecture and management and stuff. I got into programming for the money, so it's not like I'm overflowing with pride when uttering "programmer". I just think programming is a great career and the right thing to call myself for me.

There's an alternative approach where you program, but you don't call it that, and you use programming as a starting point from which you transition to some form of being involved in business as directly as possible.

It sounds a bit roundabout to me - why not just get an MBA instead? - but maybe it's the right path for some (especially considering that some prestigious MBA programs want you to have industry experience before you can even enroll.)

The important thing is to choose the path that suits your preferences, follow it consistently, and realize where your approach is most likely to succeed. Because where I work, someone applying for a programming position and not calling himself a programmer will not make a good impression.

I agree emphatically with many of the points in McKenzie's article - my favorite point is the importance of communication skills - and I very much recommend it.

How FPGAs work, and why you'll buy one

Update (June 21): this article has been published at embeddedrelated.com, where I hope to publish a follow-up soon.

Today, pretty much everyone has a CPU, a DSP and a GPU, buried somewhere in their PC, phone, car, etc. Most don't know or care that they bought any of these, but they did.

Will everyone, at some future point, also buy an FPGA? The market size of FPGAs today is about 1% of the annual global semiconductor sales (~$3B vs ~$300B). Will FPGA eventually become a must-have, or will its volume remain relatively low?

We'll try to answer this question below. In order to see how popular FPGAs could become, we'll need to discuss what FPGAs are. FPGAs are a programmable platform, but one designed by EEs for EEs rather than for programmers. So for many programmers, FPGAs are exciting yet mysterious; I hope our discussion will help demystify them.

We'll start with a common explanation of FPGAs' relatively low popularity. We'll see why that explanation is wrong - and why, if we take a closer look, we actually come to expect FPGAs to blow the competition out of the water!

This will conclude today's installment, "Why you'll buy an FPGA". A sequel is in the making, titled "Why you won't buy an FPGA". There, we'll see some of the major obstacles standing between FPGAs and world domination.

The oft-repeated wrong answer

…to the question of "why aren't FPGAs more popular?" is, "FPGA is a poor man's alternative to making chips. You can implement any circuit design in an FPGA, but less efficiently than you could in an ASIC or a custom design. So it's great for prototyping, and for low-volume products where you can't afford to make your own chips. But it makes no sense for the highest-volume devices - which happen to add up to 99% of sales, leaving 1% to FPGAs."

This is wrong because programmability is a feature, not just a tax on efficiency.

Of course a Verilog program doing convolution on an FPGA would run faster if you made a chip that runs just that program. But you typically don't want to do this, even for the highest-volume products, any more than you want to convert your C programs running on CPUs into dedicated hardware! Because you want to change your code, run other programs, etc. etc.

When programmability is required - which is extremely often - then the right thing to compare FPGAs to is another programmable platform: a DSP, a GPU, etc. And, just like FPGAs, all of these necessarily introduce some overhead for programmability. So we can no longer assume, a priori, that any one option is more efficient than another - as we did when comparing FPGAs to single-purpose ASICs.

We need benchmarks - and FPGAs' performance appears very competitive in some benchmarks. Here's what BDTI's report from 2007 says:

…we estimated that high-end FPGAs implementing demanding DSP applications … consume on the order of 10 watts, while high-end DSPs consume roughly 2-3 watts. Our benchmark results have shown that high-end FPGAs can support roughly 10 to 100 times more channels on this benchmark than high-end DSPs…

So for that benchmark, FPGAs offer 10x-100x the runtime performance, and 2x-30x the energy efficiency of DSPs - quite impressive!

But wait - how are they so efficient?

FPGAs are no longer FPGAs

Aren't FPGAs Field-Programmable Gate Arrays?

Programmable gate arrays can't multiply as efficiently as dedicated multipliers, can they? A dedicated multiplier is a bunch of gates connected with wires - the specific gates that you need for multiplying, connected specifically to the right other gates as required for multiplication.

A programmable gate array is when your gates are generic. They index into a truth table (called a look-up table or LUT) with their inputs, and fetch the answer. With a 2-input LUT, you get an OR gate or an AND gate or whatever, depending on the truth table you programmed. With 3-input LUTs, you can have a single gate computing, say, (a&b)|c, but the principle is the same:

This absolutely must be bigger and slower than just an OR gate or an AND gate!

Likewise, wires go through programmable switch boxes, which connect wires as instructed by programmable bits:

There are several switch box topologies determining which wires can be connected to which. But whatever the topology, this must be bigger and slower than wires going directly to the right gates.

All this is indeed true, and a "bare" FPGA having nothing but programmable gates and routers cannot compete with a DSP. However, today's FPGAs come with DSP slices - specialized hardware blocks placed amidst the gates and routers, which do things like multiply-accumulate in "hard", dedicated gates.

So that's how FPGAs compete with DSPs - they have DSP hardware in them! Cheating, isn't it?

Well, yes and no.

It's "cheating" in the sense that FPGAs aren't really FPGAs any more - instead, they're arrays of programmable gates plus all that other stuff. A "true FPGA" would look like this:

Instead, a high-end modern FPGA looks like this:

To be competitive in DSP applications, FPGAs need DSP slices - ALUs doing things like multiply-accumulates.

To be competitive in applications needing a CPU - which is most of them - today's FPGAs have more than just specialized ALUs. They have full-blown ARM cores implemented using "hard", non-programmable gates!

So you've been "cheated" if you thought of FPGAs as "clean slates" suitable for any design. In reality, FPGAs have specialized hardware to make them competitive in specific areas.

And you can sometimes guess where they're less competitive by observing which specializations they lack. For instance, there are no "GPU slices", and indeed I don't believe FPGAs can compete with GPUs in their own domain as they compete with DSPs. (Why not simply add GPU slices then? Here the plot thickens, as we'll see in the follow-up article.)

But of course having DSP slices is more than just "cheating" - because look at just how many DSP slices FPGAs have. The cheapest FPGAs can do hundreds of mutliply-accumulates simultaneously! (My drawing above has the wrong scale - imagine hundreds of small DSP slices near a couple of much larger CPUs.)

And hundreds of MACs is a big deal, because while anyone can cram a load of multipliers into a chip, the hard part is to connect it all together, letting a meaningful program actually use these multipliers in parallel.

For instance, TI's C64 DSPs can do 8 MACs per cycle - but only if it's a dot product. TI's C66 DSPs can do 32 MACs/cycle - but only if you're multiplying complex numbers. You only get the highest throughput for very specific data flows.

To the extent that the FPGA architecture lets you actually use an order of magnitude more resources at a time, and do that in more real-life examples, it is a rather unique achievement. And this is how they actually beat dedicated DSPs with their DSP slices, not just reach the same performance.

FPGA as a programmable accelerator architecture

So what makes FPGAs such an efficient architecture? There's no simple answer, but here are some things that FPGAs can use to their advantage:

  • No need for full-blown ALUs for simple operations: a 2-bit adder doesn't need to be mapped to a large, "hard" DSP slice - it can fit comfortably in a small piece of "soft" logic. With most processors, you'd "burn" a full-blown ALU to do the simplest thing.
  • No need for a full cycle for simple operations: on FPGAs, you don't have to sacrifice a full cycle to do a simple operation, like an OR, which has a delay much shorter than a full cycle. Instead, you can feed OR's output immediately to the next operation, say, AND, without going through registers. You can chain quite a few of these, as long as their delays add up to less than a cycle. With most processors, you'd end up "burning" a full cycle on each of these operations.
  • Distributed operand routing: most processors have their ALUs communicate through register files. With all the ALUs connected to all the registers, there's a bottleneck - this interconnect grows as the product of the number of ALUs and registers, so you can't have too many of either. FPGAs spread ALUs and registers throughout the chip, and you can connect them in ways not creating such bottlenecks - say, as a long chain, as a tree, and in many other ways. Of course you can also route everything through a bottleneck, and then your design will run at a low frequency - but you don't have to. With CPUs or DSPs, they run at a high frequency - because the amount of ALUs and registers was limited to make that frequency possible. But in FPGAs you can get both high frequencies and a lot of resources used in parallel.
  • Distributed command dispatching: a 2-issue or a 6-issue processor is common, but 100-issue processors are virtually unheard of. Partly it's because of the above-mentioned operand routing, and partly it's because of command dispatching - you'd have to fetch all those commands from memory, another bottleneck. In FPGAs, you can implement command-generating logic in simple state machines residing near your ALUs - and in the simplest case, commands are constants kept in registers residing near ALUs. This lets you easily issue 100 parallel instructions.

This "distributed" business is easier to appreciate by looking at an example. Here's a schematic implementation of a 1D convolution on an FPGA - you convolve a long vector v with an N-coefficient filter f, computing, at every i, f0*v[i] + f1*v[i-1] + f2*v[i-2] + … + fN-1*v[i-N-1]:

In this drawing, N=8, but it scales easily to arbitrary N, producing results at a slightly larger latency - the summation tree depth being log(N).

The orange boxes are registers; commands like + and * are stored in registers, as are inputs and outputs. (I'm feeding the output of * to + directly without going through a register to save screen space.) Every clock cycle, inputs are fed to ALUs, and the outputs become the new register values.

Orange boxes (registers) spread amongst green boxes (ALUs) illustrate "distributed operand and command routing". If you wonder how it all looks like in code, Verilog source code corresponding to this drawing appears near the end of the article.

And here's a linear pipeline without a summation tree:

This is a little trickier, at least to me (I had a bug in my first drawing, hopefully it's fixed). The idea is, every pair of ALUs computes a product of fk with v[i-k], adds it to the partial sum accumulated thus far, and sends the updated partial sum downstream to the next pair of ALUs.

The trick is this. The elements of v are also moving downstream, together with the sums. But after v[i] got multiplied by f0, you don't want to multiply it by f1 in the next cycle. Instead, you want to multiply v[i-1] by f1 - that's the product that we need for the convolution at index i. And then you do want to multiply v[i] by f1 once cycle later - for the convolution at index i+1. I hope that my sampling of v[i] to an intermediate register, which delays its downstream motion, does the trick.

So these two examples show how FPGA programming is different from programming most kinds of processors - and how it can be more efficient. More efficient, because you can use a lot of ALUs simultaneously with little overhead spent on dispatching commands and moving inputs and outputs between ALUs. An argument can be made that:

  • FPGAs are more flexible than SIMD/SIMT. You can give different instructions to different ALUs, and you can route operands from different places. Contrast this with SIMD instructions like add_16_bytes, with byte i always coming from offset i inside a wide register.
  • FPGAs scale better than VLIW/superscalar. More instructions can be issued simultaneously, because there's no routing bottleneck near the register file, and no instruction memory bandwidth bottleneck.
  • FPGAs are more efficient than multiple cores. Multiple cores are flexible and can scale well. But you pay much more overhead per ALU. Each core would come with its own register files and memories, and then there are communication overheads.

This gives us a new perspective on LUTs and switch boxes. Yes, they can be an inefficient, cheaper-to-manufacture alternative to dedicated gates and wires. But they are also a mechanism for utilizing the "hard" components spread in between them - sometimes better than any other mechanism.

And this is how FPGAs beating DSPs with the help of DSP slices isn't "cheating". (In fact, mature DSPs "cheat" much more by having ugly, specialized instructions. Far more specialized than FPGAs' multiply-accumulate, dot product instructions being among the least ugly. And the reason they need such instructions is they don't have the flexibility of FPGAs, so what FPGAs effectively do in software, they must do in hardware in order to optimize very specific data flows.)

I/O applications

But wait - there's more! In addition to being a hardware prototyping platform and an accelerator architecture, FPGAs are also uniquely suited for software-defined I/O.

"Software-defined I/O" is the opposite of "hardware-defined I/O" - the common state of things, where you have, for instance, an Ethernet controller implementing some share of TCP or UDP in hardware. Software-defined I/O is when you have some programmable hardware instead of dedicated hardware, and you implement the protocols in software.

What makes FPGAs good at software-defined I/O?

  • Timing control: Verilog and other hardware description languages give you more precise control over timing than perhaps any other language. If you program it to take 4 cycles, it takes 4 cycles - no cache misses or interrupts or whatever will get in your way unexpectedly. And you can do a whole lot in these 4 cycles - FPGAs are good at issuing plenty of instructions in parallel as we've seen. This means you don't have to account for runtime variability by buffering incoming data, etc. - you know that every 4 cycles, you get a new byte/pixel/etc., and in 4 cycles, you're done with it. This is particularly valuable where "deep" buffering is unacceptable because the latency it introduces is intolerable - say, in a DRAM controller. You can also do things like generating a clock signal at a desired frequency, or deal with incoming clock signal at a different frequency than yours.
  • Fine-grained resource allocation: you "burn" a share of FPGA resources to handle some peripheral device - and that's what you've spent. With other processor cores, you'll burn an entire core - "this DSP handles WiFi" - even if the core is idle much of the time. (The FPGA resources are also burnt that way - but you'll often spend less resources than a full processor core takes.) Alternatively, you can time-share that DSP core - but it's often gnarly. Many kinds of cores expose a lot of resources that must be manually context-switched at an intolerably high latency. Core asymmetry gets in the way of thread migration. And with two I/O tasks, often none can tolerate being suspended for a long while, so you'll definitely burn two cores. (One solution is hardware multithreading.)

The upshot is that relatively few processors other than FPGAs are suitable for software-defined I/O. The heavily multi-threaded XMOS is claimed to be one exception. Then there are communication processors such as the hardware-threaded Qualcomm Hexagon DSP and the CEVA-XC DSPs. But these are fairly specialized; you couldn't use them to implement a memory controller or an LVDS-to-parallel video bridge, both of which you could do with an FPGA.

And of course, FPGA's I/O capabilities can be combined with computation acceleration - get pixels and enhance the image color on the fly, get IP packets with stock info and decide which stocks to trade on the fly.

Programmable, efficient, and versatile, FPGAs are starting to sound like a great delivery platform.

Summary

There are several points that I tried to make. Some are well-known truisms, and others are my own way of looking at things, which others might find debatable or at least unusually put.

  • While FPGA are a great small-scale circuit delivery platform, they can also be a large-scale software delivery platform. You can think of FPGAs as "inefficiently simulating circuits". But in other contexts, you can also think of them as "efficiently executing programs"!
  • Instead of fixed-function gates and wires connecting specific gates to each other, FPGAs use programmable gates - configured by setting a truth table of choice - and programmable switch boxes, where incoming wires are connected to some of the other wires based on configuration bits. By itself, it's very inefficient compared to a "direct" implementation of a circuit.
  • Then how can FPGAs beat, not just CPUs, but specialized accelerators like DSPs in their own game? The trick is, they're no longer FPGAs - gate arrays. Instead, they're also arrays of RAMs and DSP slices. And then they have full-blown CPUs, Ethernet controllers, etc. implemented in fixed-function hardware, just like any other chip.
  • In such modern FPGAs, the sea of LUTs and switch boxes can be used not instead of fixed-function circuits, but as a force multiplier letting you make full use of your fixed-function circuits. LUTs and switch boxes give two things no other processor architecture has. First, the ability to use less than a full-blown ALU for simple things - and less than a full clock cycle. Second, distributed routing of commands and operands - arguably more flexible than SIMD, more scalable than superscalar execution, and more efficient than multiple instruction streams.
  • FPGAs are the ultimate platform for software-defined I/O because of their timing control (if I said 4 cycles, it takes 4 cycles) and fine-grained resource allocation (spend so many registers and ALUs per asynchronous task instead of dedicating a full core or having to time-share it).

With all these advantages, why just 1% of the global semiconductor sales? One reasonable answer is that it took FPGAs a long time to evolve into their current state. Things FPGAs have today that they didn't have in the past include:

  • Fixed-function hardware essential for performance - this gradually progressed from RAM to DSP slices to complete CPUs.
  • Quick runtime reconfiguration, so that you can run convolution and then replace it with FFT - which you can't, and shouldn't be able to do, if you're thinking of FPGA as simulating one circuit.
  • Practically useable C-to-Verilog compilers, letting programmers, at least reasonably hardcore ones, who nonetheless aren't circuit designers, to approach FPGA programming easily enough.

All of these things cater to programmers as much or more than they cater to circuit designers. This shows that FPGAs are gunning for a position in the large-scale software delivery market, outside their traditional small-scale circuit implementation niche. (Marketing material by FPGA vendors confirms their intentions more directly.)

So from this angle, FPGAs evolved from a circuit implementation platform into a software delivery platform. Being a strong programmable architecture, they're expected to rise greatly in popularity, and, like other programmable architectures, end up everywhere.

Unanswered questions

As a teaser for the sequel, I'll conclude with some questions which our discussion left unanswered.

Why do FPGAs have DSP slices and full-blown "hard" CPUs? Why not the other way around - full-blown DSP cores, and some sort of smaller "CPU slices"? Where are the GPU slices? And if rationing individual gates, flip-flops and picoseconds instead of full ALUs, registers and clock cycles is so great, why doesn't everyone else do it? Why do they all break up resources into those larger chunks and only give software control over that?

Stay tuned for the sequel - "How FPGAs work, and why you won't buy one".

P.S. Programmable - how?

So how do you program the programmable gate array? Talk is cheap, and so are Microsoft Paint drawings. Show me the code!

The native programming interface is a hardware description language like Verilog. Here's an implementation of the tree-like convolution pipeline in Verilog - first the drawing and then the code:

module conv8(clk, in_v, out_conv);
  //inputs & outputs:
  input clk; //clock
  input [7:0] in_v; //1 8-bit vector element
  output reg [18:0] out_conv; //1 19-bit result

  //internal state:
  reg [7:0] f[0:7]; //8 8-bit coefficients
  reg [7:0] v[0:7]; //8 8-bit vector elements
  reg [15:0] prod[0:7]; //8 16-bit products
  reg [16:0] sum0[0:3]; //4 17-bit level 0 sums
  reg [17:0] sum1[0:1]; //2 18-bit level 1 sums

  integer i; //index for loops unrolled at compile time

  always @(posedge clk) begin //when clk goes from 0 to 1
    v[0] <= in_v;
    for(i=1; i<8; i=i+1)
      v[i] <= v[i-1];
    for(i=0; i<8; i=i+1)
      prod[i] <= f[i] * v[i];
    for(i=0; i<4; i=i+1)
      sum0[i] <= prod[i*2] + prod[i*2+1];
    for(i=0; i<2; i=i+1)
      sum1[i] <= sum0[i*2] + sum0[i*2+1];
    out_conv <= sum1[0] + sum1[1];
  end
endmodule

This example shows how "distributed routing" actually looks in code - and the fine-grained control over resources, defining things like 17-bit registers.

And it's fairly readable, isn't it? Definitely prettier than a SIMD program spelled with intrinsics - and more portable (you can target FPGAs by different vendors as well as an ASIC implementation using the same source code; it's not trivial, but not hopeless unlike with SIMD intrinsics, and probably not harder than writing actually portable OpenCL kernels.)

Incidentally, Verilog is perhaps the quintessential object-oriented language - everything is an object, as in a physical object: a register, a wire, a gate, or a collection of simpler objects. A module is like a class, except you can't create objects (called instantiations) dynamically - all objects are known at compile time and mapped to physical resources.

Verilog insists on encapsulation as strictly as it possibly could: there's simply no way to set an object's internal state. Because how could you affect that state, physically, without a wire going in? Actually, there is a way to do that - the usual instance.member syntax; hardware hackers call this "an antenna", because it's "wireless" communication with the object's innards. But it doesn't synthesize - that is, you can do it in a simulation, but not in an actual circuit.

Which means that our example module is busted, because we can't initialize the filter coefficients, f. In simulations, we can use antennas. But on an FPGA, we'd need to add, say, an init_f input, and then when it's set to 1, we could read the coefficients from the same port we normally use to read v's elements. (BTW, not that it adds much efficiency here, but the "if" test below is an example of an operation taking less than a cycle.)

always @(posedge clk) begin
  if(init_f) begin
    f[0] <= in_v;
    for(i=1; i<8; i=i+1)
      f[i] <= f[i-1];
  end
end

A triumph of encapsulation, it's also a bit of a pity, because there are now actual wires and some control logic sitting near our coefficient registers, enlarging the circuit, only to be used upon initialization. We're used to class constructors "burning" a few memory bits; who cares - the bits are quickly swapped out from the instruction cache, so you haven't wasted resources of your computational core. But Verilog module initialization "burns" LUTs and wires, and it's not nearly as easy to reuse them for something else. We'll elaborate on this point in the upcoming sequel.

Not only is Verilog object-oriented, but it's also the quintessential language for event-driven programming: things are either entirely static (these here bits go into this OR gate), or triggered by events (changes of signals, very commonly a clock signal which oscillates between 0 and 1 at some frequency). "always @(event-list)" is how you say what events should cause your statements to execute.

Finally, Verilog is a parallel language. The "static" processes, like bits going into OR gates, as well as "event-driven processes", like statements executing when the clock goes from 0 to 1, all happen in parallel. Within a list of statements, "A <= B; C <= A;" are non-blocking assignments. They happen in parallel, so that A is assigned the value of B, and C is simultaneously assigned the (old) value of A.

So, for example, prod[i]<=f[i]*v[i] sets the new value of prod, and in parallel, sums are computed from the old values of prod, making it a pipeline and not a serial computation. (Alternatively, we could use blocking assignments, "=" instead of "<=", to do it all serially. But then it would take more time to execute our series of statements, lowering our frequency, as clk couldn't switch from 0 to 1 again until the whole serial thing completes. Synthesis tools tell you the maximal frequency of your design when they're done compiling it.)

On top of its object-oriented, event-based, parallel core, Verilog delivers a ton of sweet, sweet syntactic sugar. You can write + and * instead of having to instantiate modules with "adder myadd(a,b)" or "multiplier mymul(a,b)" - though + and * are ultimately compiled down to module instances (on FPGAs, these are often DSP slice instances). You can use if statements and array indexing operators instead of instantiating multiplexors. And you can write loops to be unrolled by the compiler, generate instantiations using loop syntax, parameterize your designs so that constants can be configured by whoever instantiates them, etc. etc.

If all this doesn't excite you and you'd rather program in C, you can, sort of. There's been loads of "high-level synthesis tools" - basically C to Verilog compilers - and their quality increased over the years.

You'd be using a weird C dialect - no function pointers or recursion, extensions to specify the exact number of bits in your integers, etc. You'd have to use various #pragmas to guide the compilation process. And you'd have things like array[index++] not actually working with a memory array - and index++ not actually doing anything - because you're getting values, not from memory, but from a FIFO, or directly from the output of another module (just like in_v in our Verilog code doesn't have to come from memory, and out_conv doesn't have to go to memory.)

But you can use C, sort of - or Verilog, for real. Either way, you can write fairly readable FPGA programs.

The bright side of dark silicon

It's been a decade or so since the end of frequency scaling, and multicore has become ubiquitous, there being no other means to increase a chip's performance.

Some multicore systems are symmetric - all cores are identical, so you can easily move work from one core to another. Others are asymmetric - as in CPU cores and GPU cores, where it's harder to move work between different types of cores.

Which is better - symmetric or asymmetric multicore?

Why symmetric is better

Three main reasons that I see:

  • Better load balancing
  • Less work for everyone
  • More redundancy

Better load balancing

Asymmetric multicore makes load balancing harder, because a GPU can't easily yank a job from a queue shared with a CPU and run that job. That's because some of those jobs are simply impossible to run on a GPU. Others run so badly that it's not worth the trouble.

And those CPU codes that could run OK on GPUs would have to be compiled twice - for the CPU and the GPU - and even then you can't make things like function pointers and vtables work (though I can imagine a hardware workaround for the latter - a translation table of sorts; maybe I should patent it. Anyway, we're very far from that being our biggest problem.)

And then you need a shared queue between the CPU and the GPU - how does that work? - or you partition the work statically (each of the 4 CPUs processes 10% of the pixels, the remaining 60% of the pixels go to the GPU cores).

But static partitioning, often quite lousy even with symmetric multicore, is awful with asymmetric multicore because how do you choose the percentages? You need to know the relative strength of the cores at each task. How do you do that - dynamically figure out the first time your program runs on a new device?

So this is all close to insane. What people actually do instead is task parallelism - they look at their different jobs, and they figure out which should run on each type of core, and optimize each task for the respective core.

But task parallelism never load-balances very well. Let's say you look for faces in an image on the GPU and then try to figure out whose faces these are on the CPUs. Then sometimes the GPU finds a lot of faces and sometimes just a few, taking roughly the same time to do so. But the CPU then has either a lot of work or just a little. So one of them will tend to be the bottleneck.

Less work for everyone

We actually touched on that above. If you wanted to do data parallelism, running the same task on all your cores but on different subsets of the data, one problem would be to optimize your code for each type of core. That's more work. Someone at the OS/system level would also need to help you with sharing task queues and vtables - still more work.

Generally, more types of core means more hardware design, more compilers, assemblers, linkers and debuggers, more manuals, and more integration work from bus protocols to program loaders, etc. etc. And, for programmers, not only more optimization work but more portability problems.

More redundancy

That's a bit futuristic, but I actually heard this argument from respectable people. The idea is, chip manufacturing yields will significantly drop at, say, 8nm processes. And then your chance to get a chip without a microscopic defect somewhere will become so low that throwing away every defective chip will be uneconomical.

Well, with symmetric multicore you don't have to throw away the chip. If the testing equipment identifies the core that is no longer useable and marks the chip accordingly using fuses or some such (which is easy to do), an OS can then run jobs on all cores but the bad one.

Nifty, isn't it?

With asymmetric multicore, you can't do that, because some type of work will have no core on which it can run.

Why asymmetric is inevitable

In two words - dark silicon.

"Dark silicon" is a buzzword used to describe the growing gap between how many transistors you can cram into a chip with each advancement in lithography vs how many transistors you can actually use simultaneously given your power budget - the gap between area gains and power gains.

It's been a couple of years since the "dark silicon" paper which predicted "the end of multicore scaling" - a sad follow-up to the end of frequency scaling.

The idea is, you can have 2x more cores with each lithography shrink, but your energy efficiency grows only by a square root of 2. So 4 shrinks mean 16x more cores - but within a fixed power budget, you can only actually use 4. So progress slows down, so to speak. These numbers aren't very precise - you have to know your specific process to make a budget for your chip - but they're actually not bad as a tool to think about this.

With 16x more area but just 4x more power, can anything be done to avoid having that other 4x untapped?

It appears that the only route is specialization - spend a large fraction of the area on specialized cores which are much faster at some useful tasks than the other cores you have.

Can you then use them all in parallel? No - symmetric or asymmetric, keeping all cores busy is outside your power budget.

But, if much of the runtime is spent running code on specialized cores doing the job N times faster than the next best core, then you'll have regained much of your 4x - or even gained more than 4x.

Gaining more than 4x has always been possible with specialized cores, of course; dark silicon is just a compelling reason to do it, because it robs you of the much easier alternative.

What about load balancing? Oh, aren't we "lucky"! It's OK that things don't load-balance very well on these asymmetric systems - because if they did, all cores would be busy all the time. And we can't afford that - we must keep some of the silicon "dark" (not working) anyway!

And what about redundancy? I dunno - if the yield problem materializes, the increasingly asymmetric designs of today are in trouble. Or are they? If you have 4 CPUs and 4 GPU clusters, you lose 25% of the performance, worse than if you had 12 CPUs; but the asymmetric system outperforms the symmetric one by more than 25%, or so we hope.

So the bright side of dark silicon is that it forces us to develop new core architectures - because to fully reap the benefits of lithography shrinks, we can't just cram more of the same cores into a same-sized chip. Which, BTW, has been getting boring, boring, boring for a long time. CPU architecture has stabilized to a rather great extent; accelerator architecture, not nearly so.

GPUs are the tip of the iceberg, really - the most widely known and easily accessible accelerator, but there are loads of them coming in endless shapes and colors. And as time goes by and as long as transistors keep shrinking but their power efficiency lags behind, we'll need more and more kinds of accelerators.

(I have a lot of fun working on accelerator architecture, in part due to the above-mentioned factors, and I can't help wondering why it appears to be a rather marginal part of "computer architecture" which largely focuses on CPUs; I think it has to do with CPUs being a much better topic for quantitative research, but that's a subject for a separate discussion.)

And this is why the CPU will likely occupy an increasingly small share of the chip area, continuing the trend that you can see in chip photos from ChipWorks et al.

P.S.

I work on switching-limited chip designs: most of the energy is spent on switching transistors. So you don't have to power down the cores between tasks - you can keep them in an idle state and they'll consume almost no energy, because there's no switching - zeros stay zeros, and ones stay ones.

Chips which run at higher frequencies and which are not designed to operate at high temperatures (where high leakage would become intolerably high - leakage grows non-linearly with temperature) are often leakage-limited. This means that you must actually power down a core or else it keeps using much of the energy it uses when doing work.

Sometimes powering down is natural, as in standby mode. Powering down midway through realtime processing is harder though, because it takes time to power things down and then to power them back up and reinitialize their pesky little bits such as cache line tags, etc.

So in a leakage-limited design, asymmetric multicore is at some point no better than symmetric multicore - if the gaps between your tasks are sufficiently short, you can't power down anything, and then your silicon is never dark, so either you make smaller chips or programs burn them.

But powering up and down isn't that slow, so a lot of workloads should be far from this sad point.

P.P.S.

I know about GreenDroid, a project by people who make the "dark silicon leads to specialization" argument quite eloquently; I don't think their specialization is the right kind - I think cores should be programmable - but that again is a subject for a separate discussion.

P.P.P.S.

Of course there's one thing you can always do with extra area which is conceptually much easier than adding new types of cores - namely, add more memory, typically L2/L3 cache. Memory is a perfect fit for the dark silicon age, because it essentially is dark silicon - its switching energy consumption is roughly proportionate to the number of bytes you access per cycle but is largely independent of the number of bytes you keep in there. And as to leakage, it's easier to minimize for memories than most other kinds of things.

Another "lucky" coincidence is that you really need caches these days because external DRAM response latency has been 100 ns for a long time while processor clocks tend to 50-200x shorter, so missing all the caches really hurts.

So it's natural to expect memories to grow first and then the accelerator zoo; again consistently with recent chip photos where, say, ARM's caches are considerably bigger the ARM cores themselves.

(Itanium famously spent 85% percent of the chip area or so on caches, but that was more of "cheating" - a way to show off performance relative to x86 when in fact the advantage wasn't there - than anything else; at least that's how Bob Colwell quoted his conversation with Andy Grove. These days however it has become one of the few ways to actually use the extra area.)