Entries Tagged 'software' ↓

Why programming isn’t for everyone

Today I learned about HyperCard, a system where you could implement a basic calculator in a few easy steps, one of them involving the following impressively English-like snippet:

on mouseUp
  get name of me
  put the value of the last word of it after card field "lcd"
end mouseUp

The article depicts HyperCard as a system making programming accessible to people who aren’t professional developers. It is claimed that Apple likely killed off the product because it’s inconsistent with its business model (roughly, devices bought to consume rather than to create).

I sympathize with the sentiment - I very much like stuff you can tinker with, and dislike business models discouraging tinkering. However, I don’t think businesses have the power to prevent anything that works well for many people from happening. A conspiracy of typewriter manufacturers could never stop the PC.

This seems especially true with software, where huge systems can be built by volunteers in their spare time. If an idea works, if a software system wants to be built around it, it will be built.

Of course it may be the case that the time hasn’t come for a programming system for non-developers. It’s just my opinion that it never will come, not really. Why?

Not because you need much education to program. Very useful stuff can be built without knowing why optimal sorting is O(n*log(n)), or even what big O means.

Not because programming languages must have, or typically have arcane syntax. As a kid, I found Pascal’s somewhat English-like “begin” and “end” off-putting, and was greatly relieved to discover Algolish braces. How close to natural language syntax can get, and whether it is at all beneficial to go there is IMO an irrelevant question. The fact is that programming languages can be very readable to people.

The main reason is that development leads to maintenance, and maintenance leads to suffering.

For example, if your program stores persistent data, and you want to change it, your changes to the program must be done such as to preserve the meaning of existing data. This part of development causes major pain everywhere, from video recording to financial databases to compiler construction. No amount of knowledge and no amount of support from the tools make this fun.

There are many other things. Everything in your program’s environment is unstable and you must constantly update the program to keep up. Your program gets cluttered with options and you forget what does what. There are cases you didn’t test - spaces in the names, empty data fields, reverse order of operations.

As a result, maintenance means dealing with misbehaving programs that eat data, send trash around, or simply make you wait for an hour and then watch them produce garbage.

This never ends and quickly stops being fun. When something useful can not be done quickly and isn’t the average person’s idea of fun, it becomes the business of professionals - or hardcore hobbyists indistinguishable from professionals. As a counter-example, many people like cooking in their spare time without necessarily getting close to the level of a chef or spending that much time cooking. Con Kolivas, on the other hand, could technically be called a “hobbyist”, but he could be called a “professional” as well.

Maybe I’m wrong, maybe there are plenty of places where a sprinkle of logic - in textual form or graphical form or whatever form - can be figured out quickly, left alone and be useful ever after. It’s just that it’s usually the opposite with me. Every time I have a nice little idea it takes me 10x the time it “should” take to implement, and most things keep biting me once in a while for a long time.

Programming isn’t for everyone because it is not fun to maintain what was fun to program.

Cycles, memory, fuel and parking

In high-performance, resource-constrained projects, you’re not likely to suddenly run out of cycles - but you’re quite likely to suddenly run out of memory. I think it’s a bit similar to how it’s easy to buy fuel for your car - but sometimes you just can’t find a parking spot.

I think the difference comes from pricing. Processor cycles are priced similarly to fuel, whereas memory bytes are priced similarly to parking spots. I think I know the problem but not the solution - and will be glad to hear suggestions for a solution.

Cycles: gradual price adjustment

If you work on a processing-intensive project - rendering, simulation, machine learning - then, roughly, every time someone adds a feature, the program becomes a bit slower. Every slowdown makes the program a bit “worse” - marginally less useable and more annoying.

What this means is that every slowdown is frowned upon, and the slower the program becomes, the more a new slowdown is frowned upon. From “it got slower” to “it’s annoyingly slow” to “we don’t want to ship it this way” to “listen, we just can’t ship it this way” - effectively, a developer slowing things down pays an increasingly high price. Not money, but a real price nonetheless - organizational pressure to optimize is proportionate to the amount of cycles spent.

Therefore, you can’t “suddenly” run out of cycles - long before you really can’t ship the program, there will be a growing pressure to optimize.

This is a bit similar to fuel prices - we can’t “suddenly” run out of fuel. Rather, fuel prices will rise long before there’ll actually be no fossil fuels left to dig out of the ground. (I’m not saying prices will rise “early enough to readjust”, whatever “enough” means and whatever the options to “readjust” are -  just that prices will rise much earlier in absolute terms, at least 5-10 years earlier).

This also means that there can be no fuel shortages. When prices rise, less is purchased, but there’s always (expensive) fuel waiting for those willing to pay the price. Similarly, when cycles become scarce, everyone spends more effort optimizing (pays a higher price), and some features become too costly to add (less is purchased) - but when you really need cycles, you can get them.

Memory: price jumps from zero to infinity

When there’s enough memory, the cost of an allocated byte is zero. Nobody notices the memory footprint - roughly, RAM truly is RAM, the cost of memory access is the same no matter where objects are located and how much memory they occupy together. So who cares?

However, there comes a moment where the process won’t fit into RAM anymore. If there’s no swap space (embedded devices), the cost of allocated byte immediately jumps to infinity - the program won’t run. Even if swapping is supported, once your working set doesn’t fit into memory, things get very slow. So going past that limit is very costly - whereas getting near it costs nothing.

Since nobody cares about memory before you hit some arbitrary limit, this moment can be very sudden: without warning, suddenly you can’t allocate anything.

This is a bit similar to a parking lot, where the first vehicle is as cheap to park as the next and the last - and then you can’t park at all. Actually, it’s even worse - memory is more similar to an unmarked parking lot, where people park any way they like, leaving much unused space. Then when a new car arrives, it can’t be parked unless every other car is moved - but the drivers are not there.

(Actually, an unmarked parking lot is analogous to fragmented memory, and it’s solved by heap compaction by introducing a runtime latency. But the biggest problem with real memory is that people allocate many big chunks where few small ones could be used, and probably would be used if memory cost was something above zero. Can you think of a real-world analogy for that?..)

Why not price memory the way we price cycles?

I’d very much like to find a way to price memory - both instructions and data - the way we naturally price cycles. It’d be nice to have organizational pressure mount proportionately to the amount of memory spent.

But I just don’t quite see how to do it, except in environments where it happens naturally. For instance, on a server farm, larger memory footprint can mean that you need more servers - pressure naturally mounts to reduce the footprint. Not so on a dedicated PC or an embedded device.

Why isn’t parking like fuel, for that matter? Why are there so many places where you’d expect to find huge underground parking lots - everybody wants to park there - but instead find parking shortages? Why doesn’t the price of parking spots rise as spots become taken, at least where I live?

Well, basically, fuel is not parking - you can transport fuel but not parking spots, for example, so it’s a different kind of competition - and then we treat them differently for whatever social reason. I’m not going to dwell on fuel vs parking - it’s my analogy, not my subject. But, just as an example, it’s perfectly possible to establish fuel price controls and get fuel shortages, and then fuel becomes like parking, in a bad way. Likewise, you could implement dynamic pricing of parking spots - more easily with today’s technology than, say, 50 years ago.

Back to cycles vs memory - you could, in theory, “start worrying” long before you’re out of memory, seeing that memory consumption increases. It’s just not how worrying works, though. If you have 1G of memory, everybody knows that you can ship the program when it consumes 950M as easily as when it consumes 250M. Developers just shrug and move along. With speed, you genuinely start worrying when it starts dropping, because both you and the users notice the drop - even if the program is “still usable”.

It’s pretty hard to create “artificial worries”. Maybe it’s a cultural thing - maybe some organizations more easily believe in goals set by management than others. If a manager says, “reduce memory consumption”, do you say “Yes, sir!” - or do you say, “Are you serious? We have 100M available - but features X, Y and Z are not implemented and users want them!”

Do you seriously fight to achieve nominal goals, or do you only care about the ultimate goals of the project? Does management reward you for achieving nominal goals, or does it ultimately only care about real goals?

If the organization believes in nominal goals, then it can actually start optimizing memory consumption long before it runs out of memory - but sincerely believing in nominal goals is dangerous. There’s something very healthy in a culture skeptical about anything that sounds good but clearly isn’t the most important and urgent thing to do. Without that skepticism, it’s easy to get off track.

How would you go about creating a “memory-consumption-aware culture”? I can think of nothing except paying per byte saved - but, while it sounds like a good direction with parking spots, with developers it could become quite a perverse incentive…

Coding standards: is consistency prettier than freedom?

Different projects have different coding standards, and some have none at all. How does it affect the quality of code and developers’ well-being? What results can we reasonably expect from a style guide?

Let’s have a look at the effect of style guides in the real world. Here’s how Jerusalem looks like:

These similarly looking buildings are near the city center. Here’s a shot of the suburbs:

Same stuff, pretty much. White buildings, red roof tiles - or plain flat roofs.

And now for something completely different:

This is Tel Aviv. Buildings don’t look similar to each other here. Nor do different parts of the city:

As you can see, in Tel Aviv, there’s no style guide - everyone builds stuff to suit their own taste.

In Jerusalem, on the other hand, buildings have to be covered with Jerusalem stone, giving them their trademark off-white color. Jerusalem owes its visual consistency to a century-old style guide enforced by municipal laws.

Here are a few observations - relevant to most style guides, I think:

  • Consistent style is either enforced or lacking. Whatever virtues freedom may have, consistency of style is not one of them.
  • Consistent style is functionally inconsequential. Buildings in Jerusalem are about as safe and comfortable as buildings in Tel Aviv.
  • Psychologically, style does matter. Many people hate visiting Tel Aviv because it’s ugly.
  • Whether consistent style is more beautiful is debatable. Many other people hate visiting Jerusalem because it’s ugly.
  • People will defeat stylistic consistency despite the style guide. Here’s an example from one of Jerusalem’s suburbs, Ramot Polin - “a housing project for honeybees“:

This is consistent with the style guide, but not very consistent with the actual look of other buildings - nor does it look very comfortable. Leading to my last observation:

  • A common style can be codified and enforced, but common sense can’t be. Municipal law mentioned “off-white”, but who would have thought to mention “rectangular”?

A sensible style guide is your one and only way to achieve consistent style - and not much else.

What if a style guide is not sensible? Here’s a building from Tirana, Albania:

Here’s another one:

Yep, that’s the style guide over there - bright colors over ugly buildings. And there’s nowhere to hide from the consistent style.

Maybe you actually love this style, and hate Jerusalem’s uniform off-white. My point is that either way, the consistent style of one of those cities leaves no place for you to like.

Tel Aviv, on the other hand, has a place to like for both Tirana lovers and Jerusalem lovers. Off-white houses with red rooftops? Neve Tzedek has what you want:

Buildings painted in primary colors? Here’s a hotel for you:

Personally, I still prefer Jerusalem though. Consistent style is better - if you like that particular style.

Requiring limestone vs banning asbestos

Can coding standards be described as style guides, or are they more than that?

The Google C++ Style Guide suggests that it is in fact more than a style guide:

The term Style is a bit of a misnomer, since these conventions cover far more than just source file formatting.

The document goes on to say that, apart from “enforcing consistency”, it also “constrains, or even bans, use of certain features” - “to avoid the various common errors and problems that these features can cause”.

“Enforcing consistency” does sound similar to requiring limestone - there’s no direct functional impact. But “banning features to avoid problems” sounds more like banning asbestos - very much because of its functional impact, which can include cancer.

However, language features are different from building materials. Asbestos was discovered, not designed, and they couldn’t know it’d cause cancer. C++ RTTI was designed and approved as a standard by strong programmers, who had in mind some cases where they thought it’d be useful.

RTTI is banned by the Google Style Guide, not the way asbestos is banned by regulations, but the way some sculptors prohibit their students to use fingers when they shape the fine details of clay. Learn to use proper sculpting tools - then do use fingers if necessary:

A query of type during run-time typically means a design problem. …you may use RTTI. But think twice about it. :-) Then think twice again.

Think four times and you’ll be allowed to use RTTI. Think 1024 times and you’re still not allowed to use asbestos in a housing project. That’s because construction standards include functional considerations, but coding standards ultimately discuss style and style alone.

That’s why the strictest coding standards allow exceptions. And that’s why every banned feature is sometimes better than the proposed alternative.

Readability through inconsistency

Style guides enforce consistency. In the real world, we’ve seen that consistent style matters psychologically. In programming, people also advocate consistency for psychological reasons:

It is very important that any programmer be able to look at another’s code and quickly understand it. Creating common, required idioms and patterns makes code much easier to understand.

Psychological reasons are important - but there are symmetric psychological arguments for inconsistency.

For example, required idioms can in fact make code easier to understand - or harder. Let’s look at idioms actually required in some style guides:

  • The use of C++ “algorithms” such as std::for_each and std::transform instead of decade-old “patterns” called loops. I expect the idea to become widespread again, together with C++11 lambdas. Here’s TheRegister’s take on the impact on readability.
  • Yoda conditions: if(5 == num). This page - first Google hit for “Yoda conditions” at the time of writing - lists only benefits and no drawbacks, and proposes to add this to your style guide. Will code become more readable though? They’re Yoda conditions! “If num is five” is how you always say it in English (and Hebrew, and Russian). If five is num, read as natural your code will not.

Of course my opinion on the readability of these patterns is debatable - which is precisely my point. Once a style guide is chosen, some people will experience the joy of fluent reading every time they hit if(5==num). Others will experience the pain of a mental roadblock - also every time.

A style guide will have something to dislike for everyone. When tastes are sufficiently different, the average amount of cringes per person stays the same under consistent style - and the variance rises (someone will hate a particularly popular mandatory pattern).

It’s like keeping wealth constant and increasing inequality - something not even a political party would advocate. How is this psychologically a win?

But let’s go ahead and assume that the “required idioms” suit everyone’s taste, and, by themselves, actually make code easier to understand. Still:

  • External libraries will not follow your style guide. They follow style guides of their own. And this inconsistency can improve readability. Code using the library stands out, and the library’s style can match the accepted domain-specific conventions better than your local style. In computer vision, X is the real world coordinate, x is the pixel coordinate - contrary to many software style guides.
  • You can’t count on stylistic conventions, because there are exceptions. Google’s code orders parameters such that inputs come first, but memcpy & snprintf don’t. You either have to look out for exceptions or risk misunderstandings by blindly assuming consistency.
  • Different people think differently, even if their code looks the same. I find it easier to understand programmers’ intents through their unique style. When they’re all forced to write superficially similarly, I can’t tell who wrote what, and what the subtext of the code is.

I’ll illustrate the last point with a couple of examples. I knew O.M. before I ever saw him and before I even knew his name. To me, he was the programmer with the two spaces before the trailing const:

inline int x()  const;

I knew him through his code: mathematically elegant, obsessive about fine details of type-based binding and modeling. I could guess what he left out with an intent to maybe add it later. I understood him.

Likewise, I can always spot G.D.’s code by the right-leaning asterisk:

int *p,*q=arr+i;

G.D. certainly couldn’t care less about types - similarly to most people with this asterisk alignment. I know his code: terse, efficient, to the point. I know what to expect.

Who wrote this code?

camelCaseName = longerCamelCaseName-camelCaseName;

I dunno, the collective unconscious wrote it. Anyone could write it - or several people patching after each other. I fail to identify with the author and guess his intent - it could be anyone. The code has no smell or taste to me.

This can sound a bit crazy - “does he actually advocate that everyone develops as uncommon style as possible as a way to mark his trail”? Of course I don’t mean that.

I think what I’m saying can make more sense if you think of style and taste in code as analogous to the taste of food. Of course it’s ridiculous to expect every restaurant to make food with a unique taste. Many people like pizza, many people know how to make pizza - so expect much similarly-tasting pizza around.

But we wouldn’t like to always eat food cooked to the same spec by restaurants in a single franchise. If someone knows to make food with a unique taste, we welcome it.

And unique taste of your food doesn’t indicate that it’s bad for your health. Moreover, food with a familiar taste can be very unhealthy - much of it is.  Code in a familiar style has a comforting look - sometimes misleadingly.

I don’t think requiring the same taste everywhere is how you improve the health of a code base.

Popular demand

I’m naturally inclined to argue against coding standards, because it feels like bureaucracy unthinkingly imposed on the work process from above. However, what if it’s not imposed from above - what if the programmers themselves want it?

Many certainly do. Many good ones do.

Incidentally, while I was writing this, I stumbled upon an article arguing for consistency - for the very reasons I use to argue against it:

If code isn’t written in a consistent style in your team, whenever you come across code with the spacing a bit wrong, the first thing your head’s going to process is “I didn’t write this.

…Exactly why I like personal style! I really didn’t write it - it’s his code, I want to understand him, and his style helps greatly.

This is a natural feeling, and as we all know coders have a hard [time] to restrain impulse to rewrite any piece of code they didn’t write.

I understand that impulse very well. Personally, I hate new food more than anyone I know - I eat the same thing every day, for months, for years. I do prefer Jerusalem to Tel Aviv. And I like consistent style - especially my own.

But why should others be forced-fed my favorite cabbage salad? Is consistency that much prettier than freedom?

Team spirit

I don’t argue with people who favor a consistent style - and I can’t. The article above is nostalgic about a team that followed a style guide “religiously”. How can nostalgia be refuted? Clearly, consistent style can create a unique team spirit that to some is valuable and memorable.

All I can and do argue is that a lack of a style guide can also create a good atmosphere that is preferred by some other people.

In fact I believe it is exactly team spirit that a style guide - or lack thereof - actually affects. All other effects come indirectly through the impact on atmosphere. Here’s my attempt at taking the social effect into account:

  • I won’t break established conventions. Following conventions is a great way to say, “I respect the local tradition and the wisdom it embodies.” And I’ll thoroughly enjoy the limestone covering our code - I like consistency. Yay, a beautiful code base! And if the convention is really bad - I hopefully won’t need to join the team in the first place.
  • However, I won’t establish and enforce conventions when I’m the one starting with a clean slate. Having no conventions is a great way to say, “join my project to express yourself without artificial constraints”.

From team spirit to grassroots bureaucracy

Did you know Ken Thompson is not allowed to check in code at Google? He said so in his Coders at Work interview:

Seibel: I know Google has a policy where every new employee has to get checked out on languages before they’re allowed to check code in. Which means you had to get checked out on C.

Thompson: Yeah, I haven’t been.

Seibel: You haven’t been! You’re not allowed to check in code?

Thompson: I’m not allowed to check in code, no.

The programmer who co-created Unix is not allowed to check in code. If this isn’t a bureaucracy, what is? But it’s inevitable - with rules, you always paint yourself into a corner that way. Allow some to break the conventions, and you will have offended everyone else. Use the same rules for everyone - and some won’t contribute.

Of course, Thompson does contribute to Google. Well, Google has plenty of ways to motivate him to do that. And, apparently, they have good programmers willing to write production code based on his uncommitted prototype code.

But we’re not all like Google that way. We can’t all afford the inevitable laughable outcomes of bureaucracies. If you come across an original programmer with an off-beat style, do you want him to join your project or to move on?

Grassroots bureaucracy is still bureaucracy. I wouldn’t object to an established bureaucracy that people claim to value. But I wouldn’t establish a new one, either.

Conclusion

A style guide can make code look prettier to some - and uglier to others - but not tangibly better, except if programmers enjoy it so much as to produce better code.

This is equally true for the lack of a style guide. The results of freedom look prettier to some and uglier to others.

Personally, I believe that one rule too few is better than one rule too many, so I don’t bother to enforce a common style.

P.S.: when I become a neat-freak

Generally, I tend to enforce little to no conventions. Here are the exceptional situations where I actually enter the ridiculous position of telling people how to write their code.

Interfaces should be consistent

I don’t care if an interface looksLikeThis or like_this. But if it uses both, then I’ll ask the author to change it to one of the styles - the one which came first. For users to feel confident, an interface should look well thought-out - which implies an illusion of a single author, which implies a consistent style.

By “interface”, I only mean the outermost stuff called by module users. Internal functions, classes, etc. can look like Tel-Aviv as far as I’m concerned. For instance, in a simple server, the “interface” to me is just the protocol, and nothing in the code itself.

Warnings should be errors

I hate the concept of compiler warnings - I link it to the concept of guilt. “Fine, be that way, do this evil implicit conversion thing - but if something happens, I will have told you so.” Why should we put up with such manipulative behavior? Pick a position - refuse to compile, or compile silently.

However, in practice, compatibility issues make what has to be errors into warnings. If it compiled 30 years ago, it has to keep compiling, even if nobody wants it to compile in new code. Even if compilers could always prove the code wrong, but initially didn’t bother, and it happens to work in old programs.

So, to the dismay of freedom-lovers, I turn warnings into errors where I can (as in -Werror) - even if I can’t cherry-pick the “right” warnings. Mainly since when a file generates 10 (false) warnings, the eleventh (truly useful) warning goes unnoticed.

Another reason is that warnings cause guilt - they are evil, and must be destroyed. If I didn’t destroy them by turning them into errors, I’d have to destroy them by disabling them. Then I’d never get that eleventh useful warning.

But except for these two pet peeves - interfaces and warnings - I do think grown-up programmers should be left alone.

P.P.S.: Greetings from the Overextended Metaphor Parrot

Originally, I had a few references to Bauhaus architecture in the text - how Tel Aviv has buildings in the Bauhaus style and Jerusalem doesn’t because of its limestone requirement, and how Ken Thompson’s inability to commit code at Google is analogous to that. However, as a commenter pointed out, there are buildings in the Bauhaus style in Jerusalem - in my own neighborhood, actually, so I obviously walked past them plenty of times.

I guess this goes to show that good architects have no trouble complying with a style guide - and that I shouldn’t overextend metaphors in areas where I’m not minimally competent.

Machine code monkey patching

A monkey patch is a way to extend or modify the runtime code of dynamic languages (e.g. Smalltalk, JavaScript, Objective-C, Ruby, Perl, Python, Groovy, etc.) without altering the original source code.

Wikipedia

For example, the Python code:

# someone else’s class
class TheirClass:
 def their_method(self):
  print “them”
obj = TheirClass()
obj.their_method()

# our function
def our_function(self):
 print “us”

# the monkey patch
TheirClass.their_method = our_function
obj.their_method()

…will print:

them
us

…showing that we have changed the behavior of TheirClass objects, including those we didn’t create ourselves. Which can’t be done with more civilized techniques like inheritance.

Here’s how you can monkey patch machine code, assuming the machine architecture is ARM:

typedef void (*funcptr)();

void monkey_patch(funcptr their_func, funcptr our_func) {
  ((int*)their_func)[0] = 0xe51ff004;
  ((int*)their_func)[1] = (int)our_func;
}
//monkey patching the memory allocator:
monkey_patch((funcptr)&malloc, (funcptr)&our_malloc);
monkey_patch((funcptr)&free, (funcptr)&our_free);

This overwrites the first instruction (32-bit word) of their_func with 0xe51ff004, which is the ARM machine code corresponding to the assembly instruction LDR PC,[PC,-4] - which means, in C-like pseudocode, PC = *(PC+4), or “jump to the program location pointed by the next word after the current program location”.

(Why the byte address PC+4 is spelled in assembly as PC-4? I recall that it’s because an ARM instruction at address X actually gets the value X+8 when referencing PC. Presumably because it is - or at some point was - the most convenient semantics for pipelined hardware to implement:

  • when the instruction at address X executes,
  • the instruction at address X+4 is decoded, and
  • the instruction at address X+8 is fetched

- so the physical PC register could very well keep the value X+8.)

So the first word of their_func is overwritten with, “jump to where the second word points”. The second word is then overwritten with our_func, and we’re all set.

Purpose

I actually did this in production code, on a bare metal target (no OS - just a boot loader that runs a massive single binary). I monkey patched the memory allocator - malloc, free, calloc, realloc - and the Unix-like I/O functions underlying that particular compiler’s <stdio.h> and <iostream> implementation - read, write, open, close, creat. The memory allocator had to be changed to work on the target dual-core chip. The I/O functions had to be changed to use our drivers, so that we could write stuff to the Flash or USB using FILE* or ofstream.

A more civilized approach, if you want to override functions in a dynamic library, is passing another library at run time with LD_PRELOAD or equivalent. And if the code is linked statically as it was in my case, you can override the functions at link time. The trouble is that the linker could refuse to cooperate.

(And in my case, we shipped libraries, the customer linked the program, and the guy who talked to the customer refused to cooperate - that is, to help them override functions at link time. He was an old-school embedded developer, the kind that don’t need no stinking malloc and printf. The project had a million lines of code very much in need of malloc and printf. He said, clean it up. Don’t call malloc on the second CPU. So I went away and monkey patched malloc anyway.

In such a case, the civilized approach is to keep trying to talk the guy into it, and then have him persuade the (even more hardcore) embedded devs at the customer’s side. What I did was what managers call “an attempt at a technical solution when a social solution is needed”. Or as programmers call it, “avoiding a couple of months of pointless discussions”. Being neither a full-time programmer nor a full-time manager, I don’t have a clear opinion which viewpoint is right. I guess it depends on how long and pointless the discussions are going to be, versus how long and pointless the code working around the “social” problem will be.)

In theory, machine code monkey patching could be used in a bunch of “legitimate” cases, such as logging or debugging. In practice, this ugly thing is probably only justified in inherently ugly situations - as is kinda true of monkey patching in general.

Pitfalls

My example implementation for the ARM assumes that a function has at least 2 instructions. An empty ARM assembly function can have just one (jump to link register). In that case, the first instruction of the next function will be overwritten. A more sophisticated version of monkey_patch() could stash the target address someplace else, and use a LDR PC,[PC,clever_offset] command instead of a constant LDR PC,[PC,-4] command.

Overwriting machine code instructions breaks code that reads (as opposed to “executes”) those instructions, counting on the original bit patterns to be stored there. This isn’t very likely to be a problem with actual ARM code, unless it was written by Mel.

On any machine with separate and unsynchronized instruction and data caches, overwriting instructions will modify the contents of the data cache but not the instruction cache. If the instructions happen to be loaded to the instruction cache at the time of overwriting, subsequent calls to the monkey-patched function might call the original function, until the instruction cache line keeping the original code happens to be evicted (which isn’t guaranteed to ever happen).

If your luck is particularly bad and the two overwritten instructions map to two adjacent cache lines, only one of which is loaded to the instruction cache at the time of overwriting, a call to the monkey-patched function might crash (since it’ll see one original instruction word and one new one). In any case, on machines where caches won’t sync automatically, one should sync them explicitly to implement self-modifying code correctly (I’ll spare you my ARM9 code doing this).

If your OS places instructions in read-only memory pages, overwriting it will not work unless you convince the OS to grant you permissions to do so.

C++

C++ virtual functions can be monkey patched more similarly to the typical dynamic language way. Instead of modifying instructions, we can overwrite the virtual function table.

Advantages:

  • more portable across machine architectures - the vtable layout doesn’t depend on the machine
  • no cache syncing problems
  • no encoding-related corner cases like very short functions or instructions used as data

Disadvantages:

  • less portable across compilers - ARM bytecode is the same with all compilers, vtable layout is not
  • fewer calls could be redirected - some compilers avoid the vtable indirection when they know the object’s type at compile time (of course inlined calls won’t be redirected with either technique)
  • only virtual functions can be redirected - typically a minority of C++ class member functions

The need to fiddle with OS memory protection is likely to remain since vtables are treated as constant data and as such are typically placed in write-protected sections.

Example C++ code (g++/Linux, tested with g++ 4.2.4 on Ubuntu 8.04):

#include <sys/mman.h>
#include <unistd.h>
#include <stdio.h>

template<class T, class F>
void monkey_patch(int their_ind, F our_func) {
  T obj; //can we get the vptr without making an object?
  int* vptr = *(int**)&obj;
  //align to page size:
  void* page = (void*)(int(vptr) & ~(getpagesize()-1));
  //make the page with the vtable writable
  if(mprotect(page, getpagesize(), PROT_WRITE|PROT_READ|PROT_EXEC))
    perror(”mprotect”);
  vptr[their_ind] = (int)our_func;
}

class TheirClass {
 public:
  virtual void some_method() {}
  virtual void their_method() { printf(”them\n”); }
};
void our_function() { printf(”us\n”); }
int main() {
  TheirClass* obj = new TheirClass;
  //gcc ignores the vtable with a stack-allocated object
  obj->their_method(); //prints “them”

  monkey_patch<TheirClass, void(*)()>(1, our_function);
  //some_method is at index 0, their_method is at 1
  //we could instead try to non-portably get the index
  //out of &TheirClass::their_method

  obj->their_method(); //prints “us”
}

Conclusion

Let’s drink to never having to do any of this (despite the fact that yes, some of us do enjoy it in a perverted way and feel nostalgic blogging about it).

Making data races manifest themselves

Or, synchronization, valgrind and poset dimension.

This story is perhaps more technical than I ever allowed myself to get online. If I may say so - it may be worth following; I can say that it really changed my perspective on what data races are.

I’ll need to explain a bit about the context:

  • how we structure our parallel apps for multi-core targets
  • how we catch synchronization bugs using a custom valgrind tool
  • which of the bugs were hard to catch that way

Then I can explain how we now catch them all (at least those having any chance to happen on our test inputs; you’ll see what I mean).

I’ll also discuss valgrind’s standard race detection tool, helgrind. I’ll try to show that the interface to parallelism it works against - pthreads - poses a much harder problem, and how races can go undetected because of that.

I hope this can be interesting outside of my specific work context because “catching all the races” seems to be a claim rarely made. Also we got to stumble upon a well-known math problem during mundane pointer-chasing - a cliche of sorts, but amusing when it actually happens.

So, here goes.

Our parallelism frameworkThe framework - goals & dependencies

…known in its current incarnation as GSF - “Goal/Graph Scheduling Framework”. It’s fairly simple - 4 paragraphs and 2 pseudo-code listings should be a reasonable introduction.

The idea is, your app is a graph of goals. The nodes, goals, are things to do - typically function pointers (could be function objects or closures). The edges are dependencies - A depends on B, so A must be executed after B (denoted with B A in the diagrams). The main loop does this:

goal = first_goal # a dummy everything depends on
while goal != last_goal: # a dummy, depends on everything
  update_dependent_goals_status(goal) # find READY goals
  for chosen in sched.choose_goals(): # scheduler
    chosen.resource.enqueue(chosen) # outgoing message Qs
  goal = dequeue_finished_goal() # incoming message Q

This runs all the goals in a graph, and can itself be invoked in a loop - in our case, we run all the goals for every grabbed video frame.

Whatever the scheduling policy, a correct execution order will be chosen. Only READY goals can run, and a goal only becomes READY when everything it depends on becomes DONE. This is implemented in update_dependent_goals_status:

goal.status = DONE
for dep in goal.dependent_goals:
  dep.num_predecessors -= 1
  if dep.num_predecessors == 0:
    dep.status = READY

The scheduling policy is implemented partly in sched.choose_goals - the scheduler, and partly in resource.enqueue - the queues. The scheduler decides who is enqueued first; some queues may reorder waiting goals by priority, or even suspend the running goal(s) when a more important one arrives. While the scheduling policy can be made complicated almost without bound, that’s all there is to the framework itself.

To summarize in standard terms, it’s a kind of a message passing framework, without explicit threads and without locking, but with (massive) data sharing. “Goal scheduled”/”goal finished” notifications are the only messages, and only used by the framework - the goals themselves always communicate through shared data. Dependencies are the only synchronization primitive. OS locks are never used, except in the guts of various memory allocators.

Drawbacks

Users, or “application programmers”, usually hate this framework with a passion, because code is hard to read and modify.

To break your app into the sort of graph described above, you need to specify the goals and the dependencies. Currently we do both in ugly ways. Goals are specified as global functions, with context passed through global variables: void goal_A() { g_ctxt… }. The hairy dependency graph is constructed by a Python script running at build time, with ifs, loops, and function calls - if do_A: make_goal(”A”,dep=”B C”). This adds ugliness to C++ code, and then makes you follow ugly Python code to figure out the execution order of the ugly C++ code.

Some of it was even uglier in the past. Some of it could become less ugly in the future. At least one serious limitation is irremediable - the inability to create goals at run time. For example, you can’t say “create a goal per processed data item” - rather, you need to explicitly, statically create a goal per CPU and distribute the data between the goals. You can’t say “if X happens, create that goal” - you need to always have that goal, and have it do nothing unless X happens. People don’t like to look at their flow that way.

Motivation

Efficiency and correctness, basically. Efficiency is tangential to our story, so I won’t discuss it (and frankly, we aren’t sure how much efficiency we gain, compared to other approaches). As to correctness - the hateful thing indeed “scales”, which is to say, it doesn’t break down. We’ve managed to ship working code for several years, given:

  • 8 to 10 heterogeneous target cores
  • Hundreds of goals
  • Dozens of applications
  • Dozens of developers

…the worst thing being “dozens of developers” - most unaware of parallelism, as I think they should be. Not that it proves we couldn’t have done it differently - I can only give my theory as to why it would be harder.

The biggest problem with parallel imperative programs is synchronizing access to shared data. Tasks/threads can, and are encouraged to communicate through messages, but eliminating sharing is hard:

  • People frequently don’t even realize they’re sharing data

  • If tools make implicit use of shared data impossible, things can get painful (think of explicitly passing around a particularly “popular” data item, or copying large/intertwined objects)

Sharing data is natural and efficient in imperative programs, and outlawing it is hard. And locks are problematic since

  • You can have deadlocks, and
  • You can forget to lock.

Now, if you schedule a statically known flow graph rather than an unconstrained threaded app, it turns out that you solve both of these problems:

  • Deadlocks are trivially prevented - at build time, check that the dependency graph has no cycles

  • Undeclared dependencies - sharing data without proper synchronization - can be found using program instrumentation

Our subject is this second bit - using program instrumentation to find race conditions (which makes everything up to this point an unusually long introduction).

We use a custom valgrind plug-in (”tool” as they call it) for program instrumentation. It works somewhat similarly to valgrind’s helgrind tool, though helgrind’s implementation is much more complicated.

However, helgrind has false negatives - it will consistently miss some of the data races, as I believe will any tool lacking the knowledge of the overall app structure. A simple example of a data race unreported by helgrind appears after the discussion on race condition detection, when the problem should become more clear.

Race conditions in a static graph

In an imperative program, goals communicate through memory. A memory access can thus be thought of as a proof of dependence. If goal A accesses memory at address X, and goal B was the last to modify X, then A depends on B. If there is no path from A to B in the dependency graph, the program is buggy. A could run before B or after it, so the result of accessing X is undefined.Write then read: the simplest dependence proof

Suppose we could intercept all load/store instructions in the program. Then we could maintain, for every byte, its current “owner” - a goal ID. A store to a location would update its owner to the ID of the currently running goal. Loads and stores could then check whether the currently running goal depends on the owner of the accessed location - all it takes is accessing a 2D array, the path matrix. Upon error, print the call stack, and the names of the 2 goals.

This sort of thing is surprisingly easy to implement on top of valgrind, almost as easy as implementing an interface with abstract onLoad and onStore methods. I thought of posting sample code but it looks like the example tool shipped with valgrind, “lackey“, comes with a load/store interception example these days.

As to the “shadow memory” you need for keeping owner IDs - you can do that with a page table, along the lines of a virtual memory implementation in an OS. The high bits of a pointer select a page and the low bits are the offset within the page, with pages allocated upon first access. Standard valgrind tools do it this way, too.

Our valgrind tool is called shmemcheck for “shared memory checker”. A lame pun on memcheck, valgrind’s most famous tool, the one reporting uninitialized memory access. “Memcheck-shmemcheck”. Probably the last time I used this sort of name - funny for the first few times, embarrassing for the next few thousands of times.

Despite the embarrassment, when we got an implementation good enough to be systematically used, it was really great - data races were decimated. Valgrind is awesome, unbelievably so. It managed to unimaginably expand what can be done to a natively compiled program, decades after the tools for building native programs were cast in stone.

The trouble with this version of, cough, shmemcheck, is that it doesn’t really work. That is, sometimes you have false negatives, so you still get to dive into core dumps. Why?

What about the opposite case?

Read then write: another shadow cell?If A loads from X that was last modified by B, A depends on B, and we detect it alright. What if A writes to X that was previously read by B? This also proves that A should depend on B. Otherwise B will sometimes run after A, reading its update to X, and sometimes before A, reading the previous value. In order to detect this dependency, we have to remember that X was read by B until A stores to X.

…What if, in addition to B, X was read by C, D, and E, all before A’s update?Reads then write: too many shadow cells

Every location always has one “owner” but can have many “users”. We can afford keeping an ID - 2 bytes in our case - for every byte. Keeping - and checking - a list of users that could grow to tens or even hundreds of IDs per location sounds impractical.

We used all sorts of application-specific arguments to convince ourselves that this problem is confined to a few particular scenarios. We had a few hacks removing some of the false negatives, at the cost of adding some false positives, and lived with that.

Choosing the order

We got used to think that the question was, how do you know who reads X before A? But then it dawned upon us that the right question was: why do they read X before A?!read then write: why not reverse the order?

And really - if B, C and D run before A because it is stated in the dependency graph that A depends on B, C and D - then there’s no bug to worry about. But if there’s no dependency declared between, say, A and B, then A could run before B just as well - and we’d detect the bug. So we only missed the race condition because of a randomly chosen order. And when we do find races, we don’t know if B depends on A or vice versa - only that we were lucky to have A run first.

What if we choose different random orders and run the same app many times? Then in some run, A will run before B and we’ll find the bug - with just one shadow cell keeping the owner.

…Or will it? We have in our dependency graph many “A, B” pairs of independent goals. If we pick random orders, how likely are these pairs to “flip” to “B, A” in some orders but not others?

The first conjecture was - very likely. Just randomize schedules by having sched.choose_goals pick a random goal among the READY ones. “There seems to be no bias in the selection; A and B are independent so the chance for A to come before B is 1/2; therefore, with N orders, the chance for A to always follow B is 1/2^N - small enough.”

Interestingly, it still sounds reasonable to me, though I found a counter-example, just out of fear that it’s too good to be true (what, years of core dumps are now behind me?) The counter example is, suppose you have two long processes, P and Q, each made up of 100 goals, P1…P100 and Q1…Q100. P and Q are declared independent - Pi never depends on Qj or vice versa. However, Pi+1 depends on Pi, similarly for Q, so P and Q goals can run in just one order.

Just two schedules would “flip” all independent pairs: (1) Run P, then Q and (2) Run Q, then P. However, sched.choose_goals has a snowflake’s chance in hell to pick these schedules. The set of READY goals will contain, at all times, 2 goals: one from P and one from Q. sched.choose_goals must either choose the one from P 100 times in a row, or the one from Q 100 times in a row. Since that’s a 1 in a 2^100 event, P1 will always run before Q100. If P1 loads from X and Q100 stores to X, we’ll never find the undeclared dependency.

One could say that things that aren’t likely to flip in such random trials are unlikely to flip in reality and it’d be 100% wrong. In reality, execution order depends on real run times and those create a very biased, and an unpredictably biased sampling of the distribution of schedules.

Now, with N goals, N*2 orders would clearly be enough to flip all pairs - for every goal, create an order where it preceeds everything it doesn’t depend on, and another order where it follows everything it doesn’t depend on. But our N is above 1000, so N*2, though not entirely impractical, is a bit too much.

Poset dimension

Our pessimism resulting from the collapse of the first conjecture lead to an outburst of optimism yielding a second conjecture - just 2 orders are enough to flip all pairs. Specifically, the orders we thought of were a DFS postorder and its “reversal”, where you reverse the order in which you traverse the children of nodes.

Works on my P & Q example indeed, but there’s still a counter-example, just a bit more elaborate. We kept looking at it until it turned out that we shouldn’t. The dependency graph makes the goals a partially ordered set or poset. Finding the minimal number of schedules needed to flip all independent pairs would find the order dimension of that poset. Which is a known problem, and, like all the good things in life, is NP-hard.

Curious as this discovery was, I was crushed by this NP-hardship. However, there still could be a good practical heuristic yielding a reasonably small number of orders.

We tried a straightforward one - along the lines of the N*2 upper bound, “flip them one by one”, but greedier. Simply try to flip as many pairs as you can in the same schedule:

  1. Start with a random schedule, and make a list of all the pairs that can be flipped.
  2. Starting with the real dependencies, add fake dependencies forcing pairs from the list to flip, until you can’t add any more without creating a cycle.
  3. Create a schedule given those extended dependencies.
  4. Remove the pairs you flipped from the list. If it isn’t empty, go to step 2.

This tended to produce only about 10 schedules for more than 1000 goals, and there was much rejoicing.

Massive testing

Instrumentation slows things down plenty, so you can’t run many tests. This is a pity because instrumentation only catches data races that actually occurred - it’s actual loads by A from location X owned by B that prove that A depends on B. But if you can’t run the program on a lot of data, you may not reach the flow path in A that uses X.

However, if you can, with just 10 schedules, flip all “A, B” pairs, you don’t need instrumentation in the first place. Just run the app with different orders on the same data, see if you get the same results. If you don’t, there’s a race condition - then run under instrumentation to have it pin-pointed to you. The order is “purposefully diversified”, but deterministic, so problems will reproduce in the second, instrumented run. Thus massive runs can be uninstrumented, and therefore, more massive.

So not only does this cheap poset dimension upper bound business solve the false negatives problem with instrumentation - it also greatly increases coverage. Of course there can still be a race that never occurred in any test runs. But now the quality of testing is defined just by the quality of the test data, never by chance. With races, whether you detect them or not is thought of as something inherently having to do with chance. It turns out that it doesn’t have to.

Helgrind’s problem

Helgrind - or any tool lacking knowledge about the app structure - can not meaningfully reorder things this way, in order to “make races manifest themselves”. A race can be detected if A that stores to X is moved to happen before B that loads from X, but in order to do that, you have to know where A and B are.

Helgrind’s only markers breaking the app into meaningful parts are synchronization calls, like mutex locks/unlocks. But it doesn’t know where they’ll happen in advance. So it can’t stir the execution so that “this lock happens as early as possible” - there’s no “this lock” until a lock actually happens. By then it’s too late to make it happen before anything else that already happened. (In theory, the app structure could be inferred dynamically from the flow, I just don’t see how I’d go about it.)

Therefore, helgrind seems to have the same problem our first versions of shmemcheck had. You can keep the single owner of a location but not its many readers; it appears that helgrind keeps one reader - the last. Helgrind’s “owner/reader” is a lock ID rather than a goal ID, but I think the principle is very similar.  I haven’t studied helgrind’s interals so I only speculate about how it works, but I easily managed to create an example where it gives a false negative:

  1. In thread A, read X without locking, as many times as you want (…or don’t want - in a real program it would be a bug, not an achievement…)
  2. After that, read X once with proper locking.
  3. In thread B, modify X, again with proper locking.

If, during testing, thread B happens to modify X after all of thread A’s reads from X, the bug in A - unsynchronized access to X - will go unnoticed. It won’t go unnoticed if thread A never locks properly - since helgrind will remember and report the last of its unsynchronized reads:

#include <pthread.h>
#include <stdio.h>

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
int shared=1;
int unused=0;
typedef struct { int read_unlocked, read_locked; } sopt;

void* child_fn(void* arg) {
  sopt* opt = (sopt*)arg;
  if(opt->read_unlocked) {
    unused += shared;
  }
  if(opt->read_locked) {
    pthread_mutex_lock(&mutex);
    unused += shared;
    pthread_mutex_unlock(&mutex);
  }
  return 0;
}

void testhelgrind(int read_unlocked, int read_locked,
                   const char* expected) {
  fprintf(stderr,“expected behavior: %s\n”, expected);
  //fork
  sopt opt = { read_unlocked, read_locked };
  pthread_t child;
  pthread_create(&child, 0, child_fn, &opt);
  //lock & modify shared data
  pthread_mutex_lock(&mutex);
  shared = 2;
  pthread_mutex_unlock(&mutex);
  //join
  pthread_join(child, 0);
}

int main(void) {
 testhelgrind(0,1,“no errors reported [true negative]“);
 testhelgrind(1,0,“data race reported [true positive]“);
 testhelgrind(1,1,“no errors reported [false negative]“);
}

Which, under valgrind –tool=helgrind, prints:

expected behavior: no errors reported [true negative]
expected behavior: data race reported [true positive]

…
==15041== Possible data race during write of size 4 at 0×804a024 by thread #1
==15041==    at 0×8048613: test_helgrind (helgrind_example.c:33)
==15041==    by 0×8048686: main (helgrind_example.c:41)
==15041==  This conflicts with a previous read of size 4 by thread #3
==15041==    at 0×804856E: child_fn (helgrind_example.c:14)
…

expected behavior: no errors reported [false negative]

…or at least it works that way on my system, where the child thread runs first.

So if helgrind does report a data race - be sure to look careful. If you have several buggy reads, and only fix the last one and run again, helgrind will very likely not report the previous one. Your new synchronization code around the last read now “masks” the previous read.

The happy ending

We still have false negatives in our data race detection, but these are just easy-to-fix low-level loose ends - or so I think/hope. Things like tracking memory writes by devices other than CPUs, or handling custom memory allocators. I’m looking forward to have these false negatives eliminated, very sincerely. Few things are uglier than debugging data races.

I never thought this through enough to tell whether this purposeful reordering business can be utilized by a framework more restrictive than “use raw threads & locks”, but less restrictive than “your app is a static graph”. If it can be (is already?) utilized elsewhere, I’d be very curious to know.

Special thanks to Yonatan Bilu for his help with the math.

The Iron Fist Coding Standard

I’ve developed the following coding standard during the years when I’ve been responsible, on and off, for reviewing code by other programmers. It’s suitable for most text-based programming languages and data specification formats.

Rules

Indent upon nesting

Where control structures or data objects nest, indentation helps a reader to keep track of the nesting level.

Indentation should render properly in all relevant IDE/editor configurations. Nobody should see stuff in the inner scope closer to the left than the outer scope. Seriously, spend 10 minutes to understand tabs vs spaces issues.

Use profanity judiciously

Profanity makes code easier to write, but harder to read and modify. Profanity signals a potential hazard, but distracts attention from the details - the opposite of a good warning. With profanity, less is more.

Be careful with comments, more careful with identifiers, and still more careful with the ones going into symbol tables. Except when mandated by an explicit requirement, under no circumstances should profanity appear in a program’s output, documentation or configuration files, or be a required part of its input.

Clean up

Bits of code that can’t possibly serve a purpose are a sign of neglect, scaring some readers and depressing others. Delete.

Semi-commented-out stuff spread around in moments of panic can be hard to clean up during the bad mood that breeds it. But that bad mood will haunt you as long as you keep bumping into that stuff. So.

Deviations

Concise nested statements or data object specifications can take one line and so need no indentation: if(error) quit.

Indenting upon nesting is neither mandated nor recommended in languages and formats where it is not customary. For example, code following a branch instruction or the <html> opening tag.

It’s nice when generated code is indented, but sometimes the white space uses up bandwidth and sometimes you get fed up with propagating the nesting level through the code generator. Generated code is mostly read by whoever wrote the generator so it’s up to you.

For profanity to appear in a program’s interaction with the user, no explicit requirement is needed when it is a common practice in the given branch of the industry.

Some code won’t be very tidy when you’re really in a hurry - so be it.

Conclusion

Adopting the Iron Fist Coding Standard will give you the benefits of a mature, field-tested system of coding guidelines while saving the cost of creating one from scratch.

My history with Forth & stack machines

My VLSI tools take a chip from conception through testing. Perhaps 500 lines of source code. Cadence, Mentor Graphics do the same, more or less. With how much source/object code?

Chuck Moore, the inventor of Forth

This is a personal account of my experience implementing and using the Forth programming language and the stack machine architecture. “Implementing and using” - in that order, pretty much; a somewhat typical order, as will become apparent.

It will also become clear why, having defined the instruction set of a processor designed to run Forth that went into production, I don’t consider myself a competent Forth programmer (now is the time to warn that my understanding of Forth is just that - my own understanding; wouldn’t count on it too much.)

Why the epigraph about Chuck Moore’s VLSI tools? Because Forth is very radical. Black Square kind of radical. An approach to programming seemingly leaving out most if not all of programming:

…Forth does it differently. There is no syntax, no redundancy, no typing. There are no errors that can be detected. …there are no parentheses. No indentation. No hooks, no compatibility. …No files. No operating system.

Black Square by Kazimir Malevich

I’ve never been a huge fan of suprematism or modernism in general. However, a particular modernist can easily get my attention if he’s a genius in a traditional sense, with superpowers. Say, he memorizes note sheets upon the first brief glance like Shostakovich did.

Now, I’ve seen chip design tools by the likes of Cadence and Mentor Graphics. Astronomically costly licenses. Geological run times. And nobody quite knows what they do. To me, VLSI tools in 500 lines qualify as a superpower, enough to grab my attention.

So, here goes.

***

I was intrigued with Forth ever since I read about it in Bruce Eckel’s book on C++, a 198-something edition; he said there that “extensibility got a bad reputation due to languages like Forth, where a programmer could change everything and effectively create a programming language of his own”. WANT!

A couple of years later, I looked for info on the net, which seemed somewhat scarce. An unusually looking language. Parameters and results passed implicitly on a stack. 2 3 + instead of 2+3. Case-insensitive. Nothing about the extensibility business though.

I thought of nothing better than to dive into the source of an implementation, pForth - and I didn’t need anything better, as my mind was immediately blown away by the following passage right at the top of system.fth, the part of pForth implemented in Forth on top of the C interpreter:

: (   41 word drop ; immediate
( That was the definition for the comment word. )
( Now we can add comments to what we are doing! )

Now. we. can. add. comments. to. what. we. are. doing.

What this does is define a word (Forth’s name for a function) called “(”. “(” is executed at compile time (as directed by IMMEDIATE). It tells the compiler to read bytes from the source file (that’s what the word called, um, WORD is doing), until a “)” - ASCII 41 - is found. Those bytes are then ignored (the pointer to them is removed from the stack with DROP). So effectively, everything inside “( … )” becomes a comment.

Wow. Yeah, you definitely can’t do that in C++. (You can in Lisp but they don’t teach you those parts at school. They teach the pure functional parts, where you can’t do things that you can in C++. Bastards.)

Read some more and…

\ conditional primitives
: IF     ( — f orig )  ?comp compile 0branch  conditional_key >mark     ; immediate
: THEN   ( f orig — )  swap ?condition  >resolve   ; immediate
: BEGIN  ( — f dest )  ?comp conditional_key <mark   ; immediate
: AGAIN  ( f dest — )  compile branch  swap ?condition  <resolve  ; immediate
: UNTIL  ( f dest — )  compile 0branch swap ?condition  <resolve  ; immediate
: AHEAD  ( — f orig )  compile branch   conditional_key >mark     ; immediate

Conditional primitives?! Looks like conditional primitives aren’t - they define them here. This COMPILE BRANCH business modifies the code of a function that uses IF or THEN, at compile time. THEN - one part of the conditional - writes (RESOLVEs) a branch offset to a point in code saved (MARKed) by IF, the other part of the conditional.

It’s as if a conventional program modified the assembly instructions generated from it at compile time. What? How? Who? How do I wrap my mind around this?

Shocked, I read the source of pForth.

Sort of understood how Forth code was represented and interpreted. Code is this array of “execution tokens” - function pointers, numbers and a few built-ins like branches, basically. A Forth interpreter keeps an instruction pointer into this array (ip), a data stack (ds), and a return stack (rs), and does this:

while(true) {
 switch(*ip) {
  //arithmetics (+,-,*…):
  case PLUS: ds.push(ds.pop() + ds.pop()); ++ip;
  //stack manipulation (drop,swap,rot…):
  case DROP: ds.pop(); ++ip;
  //literal numbers (1,2,3…):
  case LITERAL: ds.push(ip[1]); ip+=2;
  //control flow:
  case COND_BRANCH: if(!ds.pop()) ip+=ip[1]; else ip+=2;
  case RETURN: ip = rs.pop();
  //user-defined words: save return address & jump
  default: rs.push(ip+1); ip = *ip;
 }
}

That’s it, pretty much. Similar, say, to the virtual stack machine used to implement Java. One difference is that compiling a Forth program is basically writing to the code array in a WYSIWYG fashion. COMPILE SOMETHING simply appends the address of the word SOMETHING to the end of the code array. So does plain SOMETHING when Forth is compiling rather than interpreting, as it is between a colon and a semicolon, that is, when a word is defined.

So

: DRAW-RECTANGLE 2DUP UP RIGHT DOWN LEFT ;

simply appends {&2dup,&up,&right,&down,&left,RETURN} to the code array. Very straightforward. There are no parameters or declaration/expression syntax as in…

void drawRectangle(int width, int height) {
  up(height);
  right(width);
  down(height);
  left(width);
}

…to make it less than absolutely clear how the source code maps to executable code. “C maps straightforwardly to assembly”? Ha! Forth maps straightforwardly to assembly. Well, to the assembly language of a virtual stack machine, but still. So one can understand how self-modifying code like IF and THEN works.

On the other hand, compared to drawRectangle, it is somewhat unclear what DRAW-RECTANGLE does. What are those 2 values on the top of the stack that 2DUP duplicates before meaningful English names appear in DRAW-RECTANGLE’s definition? This is supposed to be ameliorated by stack comments:

: DRAW-RECTANGLE ( width height — ) … ;

…tells us that DRAW-RECTANGLE expects to find height at the top of the stack, and width right below it.

I went on to sort of understand CREATE/DOES> - a further extension of this compile-time self-modifying code business that you use to “define defining words” (say, CONSTANT, VARIABLE, or CLASS). The CREATE part says what should be done when words (say, class names) are defined by your new defining word. The DOES> part says what should be done when those words are used. For example:

: CONSTANT
   CREATE ,
   DOES> @
;
\ usage example:
7 CONSTANT DAYS-IN-WEEK
DAYS-IN-WEEK 2 + . \ should print 9

CREATE means that every time CONSTANT is called, a name is read from the source file (similarly to what WORD would have done). Then a new word is created with that name (as a colon would have done). This word records the value of HERE - something like sbrk(0), a pointer past the last allocated data item. When the word is executed, it pushes the saved address onto the data stack, then calls the code after DOES>. The code after CREATE can put some data after HERE, making it available later to the DOES> part.

With CONSTANT, the CREATE part just saves its input (in our example, 7) - the comma word does this: *HERE++ = ds.pop(); The DOES> part then fetches the saved number - the @ sign is the fetch word: ds.push( *ds.pop() );

CONSTANT works somewhat similarly to a class, CREATE defining its constructor and DOES> its single method:

class Constant
  def initialize(x) @x=x end
  def does() @x end
end
daysInWeek = Constant.new(7)
print daysInWeek.does() + 2

…But it’s much more compact on all levels.

Another example is defining C-like structs. Stripped down to their bare essentials (and in Forth things tend to be stripped down to their bare essentials), you can say that:

struct Rectangle {
  int width;
  int height;
};

…simply gives 8 (the structure size) a new name Rectangle, and gives 0 and 4 (the members’ offsets) new names, width and height. Here’s one way to implement structs in Forth:

struct
  cell field width
  cell field height
constant rectangle

\ usage example:
\ here CREATE is used just for allocation
create r1 rectangle allot \ r1=HERE; HERE+=8
2 r1 width !
3 r1 height !
: area dup width @ swap height @ * ;
r1 area . \ should print 6

CELL is the size of a word; we could say “4 field width” instead of “cell field width” on 32b machines. Here’s the definition of FIELD:

 : field ( struct-size field-size — new-struct-size )
    create over , +
    does> @ +
 ;

Again, pretty compact. The CREATE part stores the offset, a.k.a current struct size (OVER does ds.push(ds[1]), comma does *HERE++=ds.pop()), then adds the field size to the struct size, updating it for the next call to FIELD. The DOES> part fetches the offset, and adds it to the top of the stack, supposedly containing the object base pointer, so that “rect width” or “rect height” compute &rect.width or &rect.height, respectively. Then you can access this address with @ or ! (fetch/store). STRUCT simply pushes 0 to the top of the data stack (initial size value), and at the end, CONSTANT consumes the struct size:

struct \ data stack: 0
  cell ( ds: 0 4 ) field width  ( ds: 4 )
  cell ( ds: 4 4 ) field height ( ds: 8 )
constant rectangle ( ds: as before STRUCT )

You can further extend this to support polymorphic methods - METHOD would work similarly to FIELD, fetching a function pointer (”execution token”) through a vtable pointer and an offset kept in the CREATEd part. A basic object system in Forth can thus be implemented in one screen (a Forth code size unit - 16 lines x 64 characters).

To this day, I find it shocking that you can define defining words like CONSTANT, FIELD, CLASS, METHOD - something reserved to built-in keywords and syntactic conventions in most languages - and you can do it so compactly using such crude facilities so trivial to implement. Back when I first saw this, I didn’t know about DEFMACRO and how it could be used to implement the defining words of CLOS such as DEFCLASS and DEFMETHOD (another thing about Lisp they don’t teach in schools). So Forth was completely mind-blowing.

And then I put Forth aside.

It seemed more suited for number crunching/”systems programming” than text processing/”scripting”, whereas it is scripting that is the best trojan horse for pushing a language into an organization. Scripting is usually mission-critical without being acknowledged as such, and many scripts are small and standalone. Look how many popular “scripting languages” there are as opposed to “systems programming languages”. Then normalize it by the amount of corporate backing a language got on its way to popularity. Clearly scripting is the best trojan horse.

In short, there were few opportunities to play with Forth at work, so I didn’t. I fiddled with the interpreter and with the metaprogramming and then left it at that without doing any real programming.

Here’s what Jeff Fox, a prominent member of the Forth community who’ve worked with Chuck Moore for years, has to say about people like me:

Forth seems to mean programming applications to some and porting Forth or dissecting Forth to others. And these groups don’t seem to have much in common.

…One learns one set of things about frogs from studying them in their natural environment or by getting a doctorate in zoology and specializing in frogs. And people who spend an hour dissecting a dead frog in a pan of formaldehyde in a biology class learn something else about frogs.

…One of my favorite examples was that one notable colorforth [a Forth dialect] enthusiast who had spent years studying it, disassembling it, reassembling it and modifying it, and made a lot of public comments about it, but had never bothered running it and in two years of ’study’ had not been able to figure out how to do something in colorforth as simple as:

1 dup +

…[such Forth users] seem to have little interest in what it does, how it is used, or what people using it do with it. But some spend years doing an autopsy on dead code that they don’t even run.

Live frogs are just very different than dead frogs.

Ouch. Quite an assault not just on a fraction of a particular community, but on language geeks in general.

I guess I feel that I could say that if it isn’t solving a significant real problem in the real world it isn’t really Forth.

True, I guess, and equally true from the viewpoint of someone extensively using any non-mainstream language and claiming enormous productivity gains for experts. Especially true for the core (hard core?) of the Forth community, Forth being their only weapon. They actually live in Forth; it’s DIY taken to the extreme, something probably unparalleled in the history of computing, except, perhaps, the case of Lisp environments and Lisp machines (again).

Code running on Forth chips. Chips designed with Forth CAD tools. Tools developed in a Forth environment running on the bare metal of the desktop machine. No standard OS, file system or editor. All in recent years when absolutely nobody else would attempt anything like it. They claim to be 10x to 100x more productive than C programmers (a generic pejorative term for non-Forth programmers; Jeff Fox is careful to put “C” in quotes, presumably either to make the term more generic or more pejorative).

…people water down the Forth they do by not exercising most of the freedom it offers… by using Forth only as debugger or a yet another inefficient scripting language to be used 1% of the time.

Forth is about the freedom to change the language, the compiler, the OS or even the hardware design and is very different than programming languages that are about fitting things to a fixed language syntax in a narrow work context.

What can be said of this? If, in order to “really” enter a programming culture, I need to both “be solving a significant real problem in the real world” and exercising “the freedom to change the language, the compiler, the OS or even the hardware design”, then there are very few options for entering this culture indeed. The requirement for “real world work” is almost by definition incompatible with “the freedom to change the language, the compiler, the OS and the hardware design”.

And then it so happened that I started working on a real-world project about as close to Forth-level DIY as possible. It was our own hardware, with our own OS, our own compilers, designed to be running our own application. We did use standard CAD tools, desktop operating systems and editors, and had standard RISC cores in the chip and standard C++ cross compilers for them. Well, everyone has weaknesses. Still, the system was custom-tailored, embedded, optimized, standalone, with lots of freedom to exercise - pretty close to the Forth way, in one way.

One part of the system was an image processing co-processor, a variation on the VLIW theme. Its memory access and control flow was weird and limited, to the effect that you could neither access nor jump to an arbitrary memory address. It worked fine for the processing-intensive parts of our image processing programs.

We actually intended to glue those parts together with a few “control instructions” setting up the plentiful control registers of this machine. When I tried, it quickly turned out, as was to be expected, that those “control instructions” must be able to do, well, everything - arithmetic, conditions, loops. In short, we needed a CPU.

We thought about buying a CPU, but it was unclear how we could use an off-the-shelf product. We needed to dispatch VLIW instructions from the same instruction stream. We also needed a weird mixture of features. No caches, no interrupts, no need for more than 16 address bits, but for accessing 64 data bits, and 32-bit arithmetic.

We thought about making our own CPU. The person with the overall responsibility for the hardware design gently told me that I was out of my mind. CPUs have register files and pipeline and pipeline stalls and dependency detection to avoid those stalls and it’s too complicated.

And then I asked, how about a stack machine? No register file. Just a 3-stage pipeline - fetch, decode, execute. No problem with register dependencies, always pop inputs from the top of the stack, push the result.

He said it sounded easy enough alright, we could do that. “It’s just like my RPN calculator. How would you program it?” “In Forth!”

I defined the instruction set in a couple of hours. It mapped to Forth words as straightforwardly as possible, plus it had a few things Forth doesn’t have that C might need, as a kind of insurance (say, access to 16-bit values in memory).

This got approved and implemented; not that it became the schedule bottleneck, but it was harder than we thought. Presumably that was partly the result of not reading “Stack Computers: the new wave“, and not studying the chip designs of Forth’s creator Chuck Moore, either. I have a feeling that knowledgable people would have sneered at this machine: it was trivial to compile Forth to it, but at the cost of complicating the hardware.

But I was satisfied - I got a general-purpose CPU for setting up my config regs at various times through my programs, and as a side effect, I got a Forth target. And even if it wasn’t the most cost-effective Forth target imaginable, it was definitely a time to start using Forth at work.

(Another area of prior art on stack machines that I failed to study in depth was 4stack - an actual VLIW stack machine, with 4 data stacks as suggested by its name. I was very interested in it, especially during the time when we feared implementation problems with the multi-port register file feeding our multiple execution units. I didn’t quite figure out how programs would map to 4stack and what the efficiency drop would be when one had to spill stuff from the data stacks to other memory because of data flow complications. So we just went for a standard register file and it worked out.)

The first thing I did was write a Forth cross-compiler for the machine - a 700-line C++ file (and for reasons unknown, the slowest-compiling C++ code that I have ever seen).

I left out all of the metaprogramming stuff. For instance, none of the Forth examples above, the ones that drove me to Forth, could be made to work in my own Forth. No WORD, no COMPILE, no IMMEDIATE, no CREATE/DOES>, no nothing. Just colon definitions, RPN syntax, flow control words built into the compiler. “Optimizations” - trivial constant folding so that 1 2 + becomes 3, and inlining - :INLINE 1 + ; works just like : 1 + ; but is inlined into the code of the caller. (I was working on the bottlenecks so saving a CALL and a RETURN was a big deal.) So I had that, plus inline assembly for the VLIW instructions. Pretty basic.

I figured I didn’t need the more interesting metaprogramming stuff for my first prototype programs, and I could add it later if it turned out that I was wrong. It was wierd to throw away everything I originally liked the most, but I was all set to start writing real programs. Solving real problems in the real world.

It was among the most painful programming experiences in my life.

All kinds of attempts at libraries and small test programs aside, my biggest program was about 700 lines long (that’s 1 line of compiler code for 1 line of application code). Here’s a sample function:

: mean_std ( sum2 sum inv_len — mean std )
  \ precise_mean = sum * inv_len;
  tuck u* \ sum2 inv_len precise_mean
  \ mean = precise_mean >> FRAC;
  dup FRAC rshift -rot3 \ mean sum2 inv_len precise_mean
  \ var = (((unsigned long long)sum2 * inv_len) >> FRAC) - (precise_mean * precise_mean >> (FRAC*2));
  dup um* nip FRAC 2 * 32 - rshift -rot \ mean precise_mean^2 sum2 inv_len
  um* 32 FRAC - lshift swap FRAC rshift or \ mean precise_mean^2 sum*inv_len
  swap - isqrt \ mean std
;

Tuck u*.

This computes the mean and the standard deviation of a vector given the sum of its elements, the sum of their squares, and the inverse of its length. It uses scaled integer arithmetic: inv_len is an integer keeping (1<<FRAC)/length. How it arranges the data on the stack is beyond me. It was beyond me at the time when I wrote this function, as indicated by the plentiful comments documenting the stack state, amended by wimpy C-like comments (”C”-like comments) explaining the meaning of the postfix expressions.

This nip/tuck business in the code? Rather than a reference to the drama series on plastic surgery, these are names of Forth stack manipulation words. You can look them up in the standard. I forgot what they do, but it’s, like, ds.insert(2,ds.top()), ds.remove(1), this kind of thing.

Good Forth programmers reportedly don’t use much of those. Good Forth programmers arrange things so that they flow on the stack. Or they use local variables. My DRAW-RECTANGLE definition above, with a 2DUP, was reasonably flowing by my standards: you get width and height, duplicate both, and have all 4 data items - width,height,width,height - consumed by the next 4 words. Compact, efficient - little stack manipulation. Alternatively we could write:

: DRAW-RECTANGLE { width height }
  height UP
  width RIGHT
  height DOWN
  width LEFT
;

Less compact, but very readable - not really, if you think about it, since nobody knows how much stuff UP leaves on the stack and what share of that stuff RIGHT consumes, but readable enough if you assume the obvious. One reason not to use locals is that Chuck Moore hates them:

I remain adamant that local variables are not only useless, they are harmful.

If you are writing code that needs them you are writing non-optimal code. Don’t use local variables. Don’t come up with new syntax for describing them and new schemes for implementing them. You can make local variables very efficient especially if you have local registers to store them in, but don’t. It’s bad. It’s wrong.

It is necessary to have [global] variables. … I don’t see any use for [local] variables which are accessed instantaneously.

Another reason not to use locals is that it takes time to store and fetch them. If you have two items on a data stack on a hardware stack machine, + will add them in one cycle. If you use a local, then it will take a cycle to store its value with { local_name }, and a cycle to fetch its value every time you mention local_name. On the first version of our machine, it was worse as fetching took 2 cycles. So when I wrote my Forth code, I had to make it “flow” for it to be fast.

The abundance of DUP, SWAP, -ROT and -ROT3 in my code shows that making it flow wasn’t very easy. One problem is that every stack manipulation instruction also costs a cycle, so I started wondering whether I was already past the point where I had a net gain. The other problem was that I couldn’t quite follow this flow.

Another feature of good Forth code, which supposedly helps achieve the first good feature (”flow” on the stack), is factoring. Many small definitions.

Forth is highly factored code. I don’t know anything else to say except that Forth is definitions. If you have a lot of small definitions you are writing Forth. In order to write a lot of small definitions you have to have a stack.

In order to have really small definitions, you do need a stack, I guess - or some other implicit way of passing parameters around; if you do that explicitly, definitions get bigger, right? That’s how you can get somewhat Forth-y with Perl - passing things through the implicit variable $_: call chop without arguments, and you will have chopped $_.

Anyway, I tried many small definitions:

:inline num_rects params @ ;
:inline sum  3 lshift gray_sums + ;
:inline sum2 3 lshift gray_sums 4 + + ;
:inline rect_offset 4 lshift ;
:inline inv_area rect_offset rects 8 + + @ ;
:inline mean_std_stat ( lo hi — stat )
  FRAC lshift swap 32 FRAC - rshift or
;
: mean_std_loop
 \ inv_global_std = (1LL << 32) / MAX(global_std, 1);
 dup 1 max 1 swap u/mod-fx32 drop \ 32 frac bits

 num_rects \ start countdown
 begin
  1 - \ rects–
  dup sum2 @
  over sum @
  pick2 inv_area
  mean_std \ global_mean global_std inv_global_std rectind mean std
  rot dup { rectind } 2 NUM_STATS * * stats_arr OFT 2 * + + { stats }
  \ stats[OFT+0] = (short)( ((mean - global_mean) * inv_global_std) >> (32 - FRAC) );
  \ stats[OFT+1] = (short)( std * inv_global_std >> (32 - FRAC) );
  pick2       um* mean_std_stat stats 2 + h! \ global_mean global_std inv_global_std mean
  pick3 - over m* mean_std_stat stats h!
  rectind ?dup 0 = \ quit at rect 0
 until
 drop 2drop
;

I had a bunch of those short definitions, and yet I couldn’t get rid of heavy functions with DUP and OVER and PICK and “C” comments to make any sense of it. This stack business just wasn’t for me.

Stacks are not popular. It’s strange to me that they are not. There is just a lot of pressure from vested interests that don’t like stacks, they like registers.

But I actually had a vested interest in stacks, and I began to like registers more and more. The thing is, expression trees map perfectly to stacks: (a+b)*(c-d) becomes a b + c d - *. Expression graphs, however, start to get messy: (a+b)*a becomes a dup b + *, and this dup cluttering things up is a moderate example. And an “expression graph” simply means that you use something more than once. How come this clutters up my code? This is reuse. A kind of factoring, if you like. Isn’t factoring good?

In fact, now that I thought of it, I didn’t understand why stacks were so popular. Vested interests, perhaps? Why is the JVM bytecode and the .NET bytecode and even CPython’s bytecode all target stack VMs? Why not use registers the way LLVM does?

Speaking of which. I started to miss a C compiler. I downloaded LLVM. (7000 files plus a huge precompiled gcc binary. 1 hour to build from scratch. So?) I wrote a working back-end for the stack machine within a week. Generating horrible code. Someone else wrote an optimizing back-end in about two months.

After a while, the optimizing back-end’s code wasn’t any worse than my hand-coded Forth. Its job was somewhat easier than mine since by the time it arrived, it only took 1 cycle to load a local. On the other hand, loads were fast as long as they weren’t interleaved with stores - some pipeline thing. So the back-end was careful to reorder things so that huge sequences of loads went first and then huge sequences of stores. Would be a pity to have to do that manually in Forth.

You have no idea how much fun it is to just splatter named variables all over the place, use them in expressions in whatever order you want, and have the compiler schedule things. Although you do it all day. And that was pretty much the end of Forth on that machine; we wrote everything in C.

What does this say about Forth? Not much except that it isn’t for me. Take Prolog. I know few things more insulting than having to code in Prolog. Whereas Armstrong developed Erlang in Prolog and liked it much better than reimplementing Erlang in C for speed. I can’t imagine how this could be, but this is how it was. People are different.

Would a good Forth programmer do better than me? Yes, but not just at the level of writing the code differently. Rather, at the level of doing everything differently. Remember the freedom quote? “Forth is about the freedom to change the language, the compiler, the OS or even the hardware design”.

…And the freedom to change the problem.

Those computations I was doing? In Forth, they wouldn’t just write it differently. They wouldn’t implement them at all. In fact, we didn’t implement them after all, either. The algorithms which made it into production code were very different - in our case, more complicated. In the Forth case, they would have been less complicated. Much less.

Would less complicated algorithms work? I don’t know. Probably. Depends on the problem though. Depends on how you define “work”, too.

The tiny VLSI toolchain from the epigraph? I showed Chuck Moore’s description of that to an ASIC hacker. He said it was very simplistic - no way you could do with that what people are doing with standard tools.

But Chuck Moore isn’t doing that, under the assumption that you need not to. Look at the chips he’s making. 144-core, but the cores (nodes) are tiny - why would you want them big, if you feel that you can do anything with almost no resources? And they use 18-bit words. Presumably under the assumption that 18 bits is a good quantity, not too small, not too large. Then they write an application note about imlpementing the MD5 hash function:

MD5 presents a few problems for programming a Green Arrays device. For one thing it depends on modulo 32 bit addition and rotation. Green Arrays chips deal in 18 bit quantities. For another, MD5 is complicated enough that neither the code nor the set of constants required to implement the algorithm will fit into one or even two or three nodes of a Green Arrays computer.

Then they solve these problems by manually implementing 32b addition and splitting the code across nodes. But if MD5 weren’t a standard, you could implement your own hash function without going to all this trouble.

In his chip design tools, Chuck Moore naturally did not use the standard equations:

Chuck showed me the equations he was using for transistor models in OKAD and compared them to the SPICE equations that required solving several differential equations. He also showed how he scaled the values to simplify the calculation. It is pretty obvious that he has sped up the inner loop a hundred times by simplifying the calculation. He adds that his calculation is not only faster but more accurate than the standard SPICE equation. … He said, “I originally chose mV for internal units. But using 6400 mV = 4096 units replaces a divide with a shift and requires only 2 multiplies per transistor. … Even the multiplies are optimized to only step through as many bits of precision as needed.

This is Forth. Seriously. Forth is not the language. Forth the language captures nothing, it’s a moving target. Chuck Moore constantly tweaks the language and largely dismisses the ANS standard as rooted in the past and bloated. Forth is the approach to engineering aiming to produce as small, simple and optimal system as possible, by shaving off as many requirements of every imaginable kind as you can.

That’s why its metaprogramming is so amazingly compact. It’s similar to Lisp’s metaprogramming in much the same way bacterial genetic code is similar to that of humans - both reproduce. Humans also do many other things that bacteria can’t (…No compatibility. No files. No operating system). And have a ton of useless junk in their DNA, their bodies and their habitat.

Bacteria have no junk in their DNA. Junk slows down the copying of the DNA which creates a reproduction bottleneck so junk mutations can’t compete. If it can be eliminated, it should. Bacteria are small, simple, optimal systems, with as many requirements shaved off as possible. They won’t conquer space, but they’ll survive a nuclear war.

This stack business? Just a tiny aspect of the matter. You have complicated expression graphs? Why do you have complicated expression graphs? The reason Forth the language doesn’t have variables is because you can eliminate them, therefore they are junk, therefore you should eliminate them. What about those expressions in your Forth program? Junk, most likely. Delete!

I can’t do that.

I can’t face people and tell them that they have to use 18b words. In fact I take pride in the support for all the data types people are used to from C in our VLIW machine. You can add signed bytes, and unsigned shorts, and you even have instructions adding bytes to shorts. Why? Do I believe that people actually need all those combinations? Do I believe that they can’t force their 16b unsigned shorts into 15b signed shorts to save hardware the trouble?

OF COURSE NOT.

They just don’t want to. They want their 16 bits. They whine about their 16th bit. Why do they want 16 and not 18? Because they grew up on C. “C”. It’s completely ridiculous, but nevertheless, people are like that. And I’m not going to fight that, because I am not responsible for algorithms, other people are, and I want them happy, at least to a reasonable extent, and if they can be made happier at a reasonable cost, I gladly pay it. (I’m not saying you can’t market a machine with a limited data type support, just using this as an example of the kind of junk I’m willing to carry that in Forth it is not recommended to carry.)

Why pay this cost? Because I don’t do algorithms, other people do, so I have to trust them and respect their judgment to a large extent. Because you need superhuman abilities to work without layers. My minimal stack of layers is - problem, software, hardware. People working on the problem (algorithms, UI, whatever) can’t do software, not really. People doing software can’t do hardware, not really. And people doing hardware can’t do software, etc.

The Forth way of focusing on just the problem you need to solve seems to more or less require that the same person or a very tightly united group focus on all three of these things, and pick the right algorithms, the right computer architecture, the right language, the right word size, etc. I don’t know how to make this work.

My experience is, you try to compress the 3 absolutely necessary layers to 2, you get a disaster. Have your algorithms people talk directly to your hardware people, without going through software people, and you’ll get a disaster. Because neither understands software very well, and you’ll end up with an unusable machine. Something with elaborate computational capabilities that can’t be put together into anything meaningful. Because gluing it together, dispatching, that’s the software part.

So you need at least 3 teams, or people, or hats, that are to an extent ignorant about each other’s work. Even if you’re doing everything in-house, which, according to Jeff Fox, was essentially a precondition to “doing Forth”. So there’s another precondtion - having people being able to do what at least 3 people in their respective areas normally do, and concentrating on those 3 things at the same time. Doing the cross-layer global optimization.

It’s not how I work. I don’t have the brain nor the knowledge nor the love for prolonged meditation. I compensate with, um, people skills. I find out what people need, that is, what they think they need, and I negotiate, and I find reasonable compromises, and I include my narrow understanding of my part - software tools and such - into those compromises. This drags junk in. I live with that.

I wish I knew what to tell you that would lead you to write good Forth. I can demonstrate. I have demonstrated in the past, ad nauseam, applications where I can reduce the amount of code by 90% and in some cases 99%. It can be done, but in a case by case basis. The general principle still eludes me.

And I think he can, especially when compatibility isn’t a must. But not me.

I still find Forth amazing, and I’d gladly hack on it upon any opportunity. It still gives you the most bang for the buck - it implements the most functionality in the least space. So still a great fit for tiny targets, and unlikely to be surpassed. Both because it’s optimized so well and because the times when only bacteria survived in the amounts of RAM available are largely gone so there’s little competition.

As to using Forth as a source of ideas on programming and language design in general - not me. I find that those ideas grow out of an approach to problem solving that I could never apply.

If a tree falls in a forest, it kills Schrödinger’s cat

Schrödinger used to have this quantum cat which was alive and dead at the same time as long as nobody opened the box, and it was the very act of looking at the cat that made it either alive or dead. Now, I’m not sure about this quantum stuff, but if you ask me you’d always find a dead cat upon opening the box, killed by the act of not looking. In fact, if you open any random box nobody was looking into, chances are you’ll find a dead cat there. Let me give an example.

I recently chatted with a former employee of a late company I’ll call Excellence (the real name was even worse). Excellence was a company with offices right across the street that kept shrinking until the recent financial crisis. It then had to simultaneously fire almost all of its remaining employees, carefully selected as their best during the previous years when other employees were continuously fired at a lower rate. Giving us a whole lot of great hires, including MV, the co-worker in this story (though he was among those who guessed where things were going and crossed the street a bit earlier).

Rumors tell that to the extent possible, Excellence used to live up to expectations created by its name. In particular, without being encouraged or forced by customers to comply to any Software Development Methodology such as the mighty and dreadful CMM, they had (as CMM puts it) not only Established, but rigorously Maintained an elaborate design, documentation and review process which preceded every coding effort. Other than praise, MV had little to say about the process, except perhaps that occasionally someone would invent something awfully complicated that made code incomprehensible, having gone just fine through the review process because of sounding too smart for anyone to object.

Now, in our latest conversation about how things were at Excellence, MV told how he once had to debug a problem in a core module of theirs, to which no changes had been made in years. There, he stumbled upon a loop searching for a value. He noticed that when the value was found, the loop wouldn’t terminate - a continue instead of a break kind of thing. Since the right value tended to be found pretty early through the loop, and because it was at such a strategic place, test cases everyone was running took minutes instead of seconds to finish. Here’s a dead cat in a gold-plated box for ya, and one buried quite deeply.

My own professional evolution shaped my mind in such a way that it didn’t surprise me in the slightest that this slipped past the reviewer(s). What surprised me was how this slipped past the jittery bodybuilder. You see, we have this Integration Manager, one of his hobbies being bodybuilding (a detail unrelated, though not entirely, to his success in his work), and one thing he does after integrating changes is he looks at the frame rate. When the frame rate seems low, he pops up a window with the execution profile, where the program is split into about 1000 parts. If your part looks heavier than usual, or if it’s something new that looks heavy compared to the rest, it’ll set him off.

So I asked MV how come that the cat, long before it got dead and buried, didn’t set off the jittery bodybuilder. He said they didn’t have one for it to set off. They were translating between the formats of different programs. Not that performance didn’t matter - they worked on quite large data sets. But to the end user, automatic translation taking hours had about the same value as automatic translation taking seconds - the alternative was manual translation taking weeks or months. So they took large-scale performance implications of their decisions into account during design reviews. Then once the code was done and tested, it was done right, so if it took N cycles to run, it was because it took N cycles to do whatever it did right.

And really, what is the chance that the code does everything right according to a detailed spec it is tested against, but there’s a silly bug causing it to do it awfully slowly? If you ask me - the chance is very high, and more generally:

  • Though not looking at performance followed from a reasonable assessment of the situation,
  • Performance was bad, and bad enough to become an issue (though an unacknowledged one), when it wasn’t looked at,
  • Although the system in general was certainly “looked at”, apparently from more eyes and angles than “an average system”, but it didn’t help,
  • So either you have a jittery bodybuilder specifically and continuously eyeballing something, or that something sucks.

Of course you can save effort using jittery automated test programs. For example, we’ve been running a regression testing system for about a year. I recently decided to look at what’s running through it, beyond the stuff it reports as errors that ought to be fixed (in this system we try to avoid false positives to the point of tolerating some of the false negatives, so it doesn’t complain loudly about every possible problem). I found that:

  • It frequently ran out of disk space. It was OK for it to run out of disk space at times, but it did it way too often now. That’s because its way of finding out the free space on the various partitions was obsoleted by the move of the relevant directories to network-attached storage.
  • At some of the machines, it failed to get licenses to one of the compilers it needed - perhaps because the env vars were set to good values with most users but not all, perhaps because of a compiler upgrade it didn’t take into account. [It was OK for it to occasionally fail to get a license (those are scarce) - then it should have retried, and at the worst case report a license error. However, the compiler error messages it got were new to it, so it thought something just didn't compile. It then ignored the problem on otherwise good grounds.]
  • Its way of figuring out file names from module names failed for one module which was partially renamed recently (don’t ask). Through an elaborate path this resulted in tolerating false negatives it really shouldn’t.

And I’m not through with this thing yet, which to me exemplifies the sad truth that while you can have a cat looking at other cats to prevent them from dying, a human still has to look at that supervisor cat, or else it dies, leading to the eventual demise of the others.

If you don’t look at a program, it rots. If you open a box, there’s a dead cat in it. And if a tree falls in a forest and no one is around to hear it, it sucks.

Getting the call stack without a frame pointer

Everything I know about getting the current call stack of C or C++ programs, including ones compiled with -fomit-frame-pointer or an equivalent, with or without a debugger. Hardly entertaining except for those specifically wishing to lay their hands on call stacks.

We’ll start with trivia people sharing my unhealthy interest in call stacks probably know. There are two contexts where you might want to get a call stack:

  1. Look at the call stack in a debugger to see what the program is doing.
  2. Get a representation of the call stack inside the program itself. For example, a memory profiler might want to attach the call stack identifying the context where the allocation is made to each memory block to see who allocates the most.

Sometimes (2) can be implemented using (1) by running a debugger programmatically and asking it for the current call stack, and sometimes it can’t (too much communication overhead, or no debugger available - for example, a program stripped of debugging information and running on a standalone embedded board).

The straightforward way of getting the current call stack, used both by debuggers and by programs curious about their own stacks, relies on the frame pointer. The idea is that a machine register is reserved for keeping a pointer into the stack, called the frame pointer, and every function is compiled to do the following in its prologue:

  • Push the return address to the stack
  • Push the frame pointer to the stack
  • Save the address of the resulting two-pointer structure to the frame pointer register

This creates a linked list on the stack, with every node keeping a return address - this list is the call stack (and this is why debuggers show the points of return from function calls and not the points of call, a bit annoyingly for function calls spanning many source code lines - in fact we get the return stack and not the call stack). Here’s how you get this list from within a program:

struct stack_frame {
  struct stack_frame* next;
  void* ret;
};
int get_call_stack(void** retaddrs, int max_size) {
  /* x86/gcc-specific: this tells gcc that the fp
     variable should be an alias to the %ebp register
     which keeps the frame pointer */
  register struct stack_frame* fp asm("ebp");
  /* the rest just walks through the linked list */
  struct stack_frame* frame = fp;
  int i = 0;
  while(frame) {
    if(i < max_size) {
      retaddrs[i++] = frame->ret;
    }
    frame = frame->next;
  }
  return i;
}

The code for getting the list head pointer depends on the platform, the list structure itself is common to many machines and compilers. The return addresses may be converted to function names and source line numbers with addr2line -f or similar. When the program can’t access its own debug info during its execution as in the case of embedded devices, the translation from addresses to names will be a separate offline step.

The whole frame pointer business is fairly widely documented, with a bunch of source code available centered around getting the call stack that way. I think the GNU backtrace function and the Windows StackWalk64 function, which these days are probably a better alternative than code snippets like the one above, also use this linked list when available.

Now, what happens if the compiler is told to avoid generating the code maintaining the frame pointer (-fomit-frame-pointer for gcc, /Oy for MSVC), or not told to override its default behavior of not generating such code (-ga for Green Hills C++)?

Admittedly it doesn’t appear to be such a good idea to make debugging that much harder in order to save a few instructions. However, there are reasons to do so. For example, one common consequence of Software Design is lots of functions doing little or nothing except for delegating their work to another function. Without frame pointer maintenance, such a call is just a jump instruction - your callee function will return directly to the address saved by your caller. With frame pointers, you need a whole prologue and epilogue here. Anyway, we won’t discuss the benefits of frame pointer omission since measuring the overhead for your particular code will be more reliable than such a discussion anyway.

Compiling without frame pointers hits code trying to obtain its own context harder than it hits debuggers, because debuggers don’t really need frame pointers and only (sometimes) rely on them for simplicity of implementation. Given a return address, a debugger can tell:

  1. Which function it belongs to (unlike the program itself, a debugger is necessarily supposed to have access to the symbol table)
  2. Where that function keeps the return address (the compiler knows that, so it can tell the debugger)
  3. The amount by which the function decrements the stack pointer, assuming the stack grows downwards - again something the compiler knows. Now that the debugger knows the previous return address and the previous stack pointer, it can go back to step 1.

So a debugger can do just fine without frame pointers as long as the compiler gives it enough information about the layout of the stack. I’ve been debugging without frame pointers for a long time with the Green Hills MULTI debugger which uses a proprietary debug info format. More recently the DWARF format, gcc and gdb seem to have caught up and now programs compiled with -fasynchronous-unwind-tables -fomit-frame-pointer are debuggable with gdb. The information generated by -fasynchronous-unwind-tables seems to go to a separate ELF section called .eh_frame_hdr.

Not only will gdb use .eh_frame_hdr, but the GNU backtrace function appears to be using it as well (it doesn’t work under -fomit-frame-pointer but apparently does work when you add -fasynchronous-unwind-tables - although the docs explicitly say: “frame pointer elimination will stop backtrace from interpreting the stack contents correctly”). Nor is this section stripped from the program - it’s not implemented as a “normal” debug information section but as an allocated data section, so it’s always available to a program (in particular, to the backtrace function).

So under gcc, all call stack problems seem to be solved - unless you trust the docs (!?), or unless some code isn’t compiled with the right flags because of not being up to date or someone being too greedy to allocate space for a debug info section. Outside gcc, or more precisely DWARF, I don’t think a stripped program can access such debug info.

Is there a way to get a call stack without a frame pointer, without a debugger and without debug info?

For years I was sure that the answer was “no”, hence some things will only work under a separate build mode - just like the release build but with frame pointers. Then one time the Green Hills debugger failed to list the call stack for some reason as it sometimes does, but this time we really wanted to decipher it. And we figured that we can in fact do the same thing the debugger does, except we’ll understand from the assembly code what the debugger usually understood from debug information.

Specifically, to understand where the return address is kept and by what amount the stack pointer is decremented, you need to find the instructions doing (or undoing) these things in the prologue (or the epilogue) of the function. This worked, but due to either inertia or stupidity it took me months to realize that you can write code doing this. Anyway, here’s how it works on a 32b MIPS processor under the Green Hills compiler. The prologue code of a function will contain instructions like these:

main+0: 27bdffe8 addiu sp, sp, -0x18
main+4: afbf0014 sw    r31, 0x14(sp)

The add immediate instruction decrements the stack pointer, and the store word instruction saves the return address from the register where it’s saved by the caller to some place on the stack. The high 16 bits of these instructions don’t depend on the function, encoding the “addui sp, sp” and the “sw, r31 …(sp)” parts. The low 16 bits encode a signed offset. So we can obtain the call stack from our code disassembling it thusly:

/* get previous stack pointer and return address
   given the current ones */
int get_prev_sp_ra(void** prev_sp, void** prev_ra,
                   void* sp, void* ra) {
  unsigned* wra = (unsigned*)ra;
  int spofft;
  /* scan towards the beginning of the function -
     addui sp,sp,spofft should be the first command */
  while((*wra >> 16) != 0x27bd) {
    /* test for "scanned too much" elided */
    wra--;
  }
  spofft = ((int)*wra << 16) >> 16; /* sign-extend */
  *prev_sp = (char*)sp - spofft;
  /* now scan forward for sw r31,raofft(sp) */
  while(wra < (unsigned*)ra) {
    if((*wra >> 16) == 0xafbf) {
      int raofft = ((int)*wra << 16) >> 16; /* sign */
      *prev_ra = *(void**)((char*)sp + raofft);
      return 1;
    }
    wra++;
  }
  return 0; /* failed to find where ra is saved */
}

The call stack will then be produced by the following loop:

int get_call_stack_no_fp(void** retaddrs, int max_size) {
  void* sp = get_sp(); /* stack pointer register */
  void* ra = get_ra(); /* return address register */
  /* adjust sp by the offset by which this function
     has just decremented it */
  int* funcbase = (int*)(int)&get_call_stack_no_fp;
  /* funcbase points to an addiu sp,sp,spofft command */
  int spofft = (*funcbase << 16) >> 16; /* 16 LSBs */
  int i=0;
  sp = (char*)sp-spofft;
  do {
    if(i < max_size) {
      retaddrs[i++] = ra;
    }
  }
  while(get_prev_sp_ra(&sp,&ra,sp,ra));
  return i; /* stack size */
}

get_sp and get_ra access registers so they must be assembly macros, which in the case of Green Hills can be spelled like this:

asm void* get_ra() {
  move $v0, $ra
}
asm void* get_sp() {
  move $v0, $sp
}

Under MIPS32 and Green Hills, this code seems to be giving decent call stacks except for the inevitable omission of function calls done without saving the return address; the most common case of those - simple delegating functions - was already mentioned above. If f calls g which does (almost) nothing except for calling h, and g doesn’t bother to save the return address to the stack, having h return directly to f, then the call stack will contain f and h but not g. Not much of a problem and sometimes even an advantage as far as I’m concerned, since g is rarely interesting. Also, you can get this problem irregardless of the lack of a frame pointer - for example, gcc -O3 on x86 will not maintain the call stack accurately even without -fomit-frame-pointer, generating the following ridiculous code:

pushl   %ebp
movl    %esp, %ebp
popl    %ebp
; so what was the point of setting up and
; immediately destroying a stack frame?
jmp     print_trace

Now, although code like get_prev_sp_ra looking for 0×27bd admittedly is a heuristic relying on undocumented platform-specific behavior, it looks like a passable way of getting a call stack on a RISC machine like MIPS, ARM or PowerPC. What about the x86 though? Effectively we have our code partially disassembling itself here, which is not nearly as easy with the x86 (in particular, I don’t think there’s a way to scan backwards because of the variable encoding length; although we could look at epilogues instead of prologues just as well).

Instead of dragging in a disassembler, we can use an external program, such as, well, a debugger. This obviously defeats the purpose of being able to get the call stack without a debugger. But this purpose isn’t very interesting on the x86 in the first place because there you’re rarely stuck in situations where a program can’t run a debugger.

The only point of the disassembling business on the x86 thus remains to deal with programs compiled without a frame pointer and without debug information making it possible to get the call stack nonetheless. I don’t know if anybody has such a problem these days, now that gcc has -fasynchronous-unwind-tables - perhaps someone uses compilers which can’t do this or binaries compiled without this, and perhaps the problem is extinct on the x86. For what it’s worth, here’s a script getting the call stack from a core file without relying on gdb’s bt command but relying on its disassemble command. Usage: python bt <program> <core>. No warranty, …or FITNESS FOR A PARTICULAR PURPOSE.

And this is all I know about getting the call stack in C or C++, something users of other languages can do, in the (unlikely) absence of a library function doing just that, simply by throwing an exception, immediately catching it and using its getStackTrace method or some such.

What makes cover-up preferable to error handling

There was a Forth tutorial which I now fail to find that literally had a “crash course” right in the beginning, where you were shown how to crash a Forth interpreter. Not much of a challenge - `echo 0 @ | pforth` does the trick for me - but I liked the way of presentation: “now we’ve learned how to crash, no need to return to that in the future”.

So, let’s have a Python & Perl crash course - do something illegal and see what happens. We’ll start with my favorite felony - out-of-bounds array access:

python -c 'a=(1,2,3); print "5th:",a[5]‘
perl -e ‘@a=(1,2,3); print “5th: $a[5]\n”‘

The output:

5th:
Traceback (most recent call last):
  File "<string>", line 1, in <module>
IndexError: tuple index out of range
5th:

Python in fact crashed, telling us it didn’t like the index.

Perl was more kind and rewarded our out-of-bounds index with what looks like the empty string. Being the kind of evildoer who’s only further provoked by the gentle reactions of a do-gooder, what shall we do to further harass it? Well, it looks like anything makes a good index (and I mean anything: if @a=(11,22), then $a["1"]==22 and $a["xxx"]==11). But perhaps some things don’t make good arrays.

python -c 'x=5; print "2nd:",x[2]‘
perl -e ‘$x=5; print “2nd: $x[2]\n”‘

Output:

2nd:
Traceback (most recent call last):
  File "<string>", line 1, in <module>
TypeError: 'int' object is unsubscriptable
2nd:

Python gives us its familiar elaborate complains, while Perl gives us its familiar laconic empty string. Its kindness and flexibility are such that in a numeric context, it would helpfully give us the number 0 - the empty string is what we get in a string context.

Is there any way to exceed the limits of Perl’s patience? What about nonsensical operations - I dunno, concatenating a hash/dictionary/map/whatever you call it and a string?

python -c 'map={1:2,3:4}; print "map+abc:",map+"abc"'
perl -e '%map=(1,2,3,4); print "map+abc: ",%map . "abc\n"'

Output:

map+abc:
Traceback (most recent call last):
  File "<string>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'dict' and 'str'
map+abc: 2/8abc

Python doesn’t like our operands but Perl retains its non-judgmental frame of mind. A Perl detractor could point out that silently converting %map to “2/8″ (hash size/reserved space) is patently insane. A Perl aficionado could point out that Perl seems to be following Python’s motto “Explicit is better than implicit” better than Python itself. In Python you can’t tell the type of map at the point of usage. Perl code clearly states it’s a hash with %, moreover . specifically means string concatenation (as opposed to +). So arguably you get what you asked for. Well, the one thing that is not debatable is that we still can’t crash Perl.

OK, so Perl is happy with indexes which aren’t and it is happy with arrays which aren’t, and generally with variables of some type which aren’t. What about variables that simply aren’t?

python -c 'print "y:",y'
perl -e 'print "y: $y\n"'

Output:

y:
Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'y' is not defined
y:

NameError vs that hallmark of tolerance, the empty string, a helpful default value for a variable never defined.

By the way, this is how $x[5] evaluates to an empty string when x isn’t an array, I think. $x[5] is unrelated to the scalar variable $x, it looks for the array variable @x in another namespace. There’s no @x so you get an empty array, having no 5th element so you get “”. I think I understand it all, except for one thing: is there any way at all to disturb the divine serenity of this particular programming language?

python -c 'a=0; print 1/a'
perl -e '$a=0; print 1/$a'

This finally manages to produce:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
ZeroDivisionError: integer division or modulo by zero
Illegal division by zero at -e line 1.

The second message about “illegal” division by zero (is there any other kind?) comes from our no longer tolerant friend, making me wonder. What is so special about division by zero? Why not be consistent with one’s generally calm demeanor and return something useful like “” or 0? Would be perfectly reasonable - I did it myself, more accurately asked to have a hardware divider returning zero in these cases. Because there wasn’t what hardware calls exception handling (having the processor jump to a handler in the middle of an instruction stream). We lived happily ever after, so what’s wrong with 0?

But the real question is, what explains the stunning difference between Perl’s and Python’s character? Is it philosophical, “There’s More Than One Way To Do It (TMTOWTDI)” vs “There should be one - and preferably only one - obvious way to do it; Although that way may not be obvious at first unless you’re Dutch” (actual Perl and Python mottos, respectively)? The latter approach encourages to classify program behavior as “erroneous” where the former tends to instead assume you’re knowingly doing something clever in Yet Another Possible Way, right?

Modernism vs Postmodernism, maybe, as outlined by Perl’s author in “Perl, the first postmodern computer language“? “Perl is like the perfect butler. Whatever you ask Perl to do, it says `Very good, sir,’ or `Very good, madam.’ … Contrast that with the Modern idea of how a computer should behave. It’s really rather patronizing: `I’m sorry Dave. I can’t allow you to do that.’” The latter can be illustrated by Python’s way of answering the wish of its many users to use braces rather than indentation for scoping:

>>> from __future__ import braces
SyntaxError: not a chance

So, “Very good, sir” vs “I can’t allow you to do that”. Makes sense with Python vs Perl, but what about, say, Lisp vs C++?

Lisp definitely has a “There’s More Than One Way To Do It” motive in its culture. Look how much control flow facilities it has compared to Python - and on top of that people write flow macros, and generally “if you don’t like the way built-in stuff works, you can fix it with macros”, you know the drill. Guessing the user’s intent in ambiguous situations? Perl says that it “uses the DWIM (that’s “Do What I Mean”) principle” for parsing, borrowing a term from Lisp environments. And yet:

(print (let ((x (make-array 3))) (aref x 5)))
*** - AREF: index 5 for #(NIL NIL NIL) is out of range
(print (let ((x 5)) (aref x 2)))
*** - AREF: argument 5 is not an array
(print (let ((m (make-hash-table))) (concatenate 'string m "def")))
*** - CONCATENATE: #S(HASH-TABLE :TEST FASTHASH-EQL) is not a SEQUENCE
(print y)
*** - EVAL: variable Y has no value
(print (/ 1 0))
*** - division by zero

5 out of 5, just like Python. Contrast that with C++ which definitely has a Bondage and Discipline culture, what with all the lengthy compiler error messages. Actually C++ would score 4 out of 5 on this test, but the test is a poor fit for statically typed languages. A more appropriate way to evaluate C++’s error handling approach would be to focus on errors only detectable at run time. The following message from The UNIX-HATERS Handbook records the reaction of a lisper upon his encounter with this approach:

Date: Mon, 8 Apr 91 11:29:56 PDT
From: Daniel Weise
To: UNIX-HATERS
Subject: From their cradle to our grave.

One reason why Unix programs are so fragile and unrobust is that C
coders are trained from infancy to make them that way. For example,
one of the first complete programs in Stroustrup’s C++ book (the
one after the “hello world” program, which, by the way, compiles
into a 300K image), is a program that performs inch-to-centimeter
and centimeter-to-inch conversion. The user indicates the unit of the
input by appending “i” for inches and “c” for centimeters. Here is
the outline of the program, written in true Unix and C style:

#include <stream.h>

main() {
  [declarations]
  cin >> x >> ch;
    ;; A design abortion.
    ;; This reads x, then reads ch.
  if (ch == ‘i’) [handle "i" case]
  else if (ch == ‘c’) [handle "c" case]
  else in = cm = 0;
    ;; That’s right, don’t report an error.
    ;; Just do something arbitrary.
[perform conversion] }

Thirteen pages later (page 31), an example is given that implements
arrays with indexes that range from n to m, instead of the usual 0 to
m. If the programmer gives an invalid index, the program just
blithely returns the first element of the array. Unix brain death forever!

You could say that the sarcasm in the Lisp-style comments proudly intermixed with C++ code is uncalled for since example programs are just that - example programs. As to the dreaded out-of-bound array access cited in the second example, well, C++ doesn’t handle that to avoid run time overhead.

But the cited program didn’t just ignore the range problem the way C would - it went to the trouble of checking the index and then helpfully returned the 0th element the way Perl would. Probably as one part of its illustration how in C++ you could have custom array types which aren’t like C arrays. But why Postmodern Perl arrays, in a generally Disciplined language?

Well, it was 1991 and C++ exceptions were very young, likely younger than the cited example programs. (They’ve since aged but didn’t improve to the point where most library users would be happy to have to catch them, hence many library writers aren’t throwing them.)

Likewise, Perl had all the features used in the examples above before it had exceptions - or more accurately before it had them under the spotlight, if I understand this correctly. (In Perl you handle exceptions by putting code in an eval { … } block and then calls to the die() function jump to the end of that block, saving die’s argument to $@ - instead of, well, dying. I think Perl had this relatively early; however it seems to only have become idiomatic after Perl 5’s OO support and an option to send exception objects to $@, with people using die strictly for dying prior to that.) Perhaps Perl’s helpful interpretations of obvious nonsense like $a["xxx"] aren’t that helpful after all, but what would you rather have it do - die()?

AFAIK Python had exceptions under the spotlight from the beginning - although similarly to Perl it had exception strings before it had exception classes. And in fact it does its best to adhere to its “Errors should never pass silently” philosophy, the few deviations coming to mind having to do with Boolean contexts - the falsehood of empty strings, None and 0 together with None!=False/0==False/1==True/2!=True and similar gateways to pornographic programming.

Lisp has conditions and restarts which blow exceptions out of the water, hence its willingness to report errors isn’t surprising. However, it gained these features in the 80s; what did previous dialects do? ERRORSET, which is similar to a try/catch block, appears to predate the current error handling system, but it doesn’t seem to have been there from the very beginning, either. I’m not familiar with the Lisp fossil record, but there’s a function for indexing lists called NTH which returns NIL given an out-of-bounds index. Lists definitely predate arrays, so I assume NTH likewise predates AREF which complains given a bad index. Perhaps NTH doesn’t complain about bad indexes since it also predates ERRORSET and any other form of exception handling?

The pattern seems to be: if the language has exceptions, most of its features and libraries handle errors. If it doesn’t, they don’t; errors are covered up.

(Although I won’t be surprised if I’m wrong about the Lisp history part because Lisp is generally much more thoughtful than I am  - just look at all the trouble Interlisp, by now an ancient dialect, went into in order to figure out whether a user wants to get an opportunity to fix an error manually or would rather have the program silently return to the topmost ERRORSET.)

awk and early PHP lack exceptions and are happy with out-of-bound array access. Java and Ruby have exceptions and you’ll get one upon such access. It isn’t just the culture. Or is it? Perl is PHP’s father and awk is PHP’s grandfather. *sh and make, which, like Perl, produce an empty string from $nosuchvar, aren’t good examples, either - sh is Perl’s mother and make is Perl’s sister. Is it really the Unix lineage that is at fault as suggested by a dated message to a late mailing list?

Here’s JavaScript, an offspring of Lisp:

>>> [1,2,3][5]+5
NaN
>>> [1,2,3][5]+”abc”
“undefinedabc”

I think this definitely rivals Perl. Apparently it’s not the lineage that is the problem - and JavaScript didn’t have exceptions during its first 2 years.

The thing is, errors are exceedingly gnarly to handle without exceptions. Unless you know what to do at the point where the error is detected, and you almost never know what to do at the point where the error is detected, you need to propagate a description of the error up the call chain. The higher a function sits up the call chain, the more kinds of errors it will have to propagate upwards.

(With a C++ background it can look like the big problem doesn’t come from function calls but from operators and expressions which syntactically have nowhere to send their complaints. But a dynamic language could have those expressions evaluate to a special Error object just as easily as it can produce “” or “undefined”. What this wouldn’t solve is the need to clutter control flow somewhere down the road when deciding which control path to take or what should go to files or windows, now that every variable can have the value Error tainting all computations in its path.)

Different errors carry different meta-data with them - propagating error codes alone ought to be punishable by death. What good is a “no such file” error if I don’t know which file it is? What good is a “network layer malfunction” error if I can’t tell that in fact it’s a “no such file” error at the network layer because a higher level lumps all error codes from a lower level into one code? (It always lumps them together since the layers have different lifecycles and the higher level can’t be bothered to update its error code list every time the lower level does.)

Different meta-data means, importantly for static languages, different types of output objects from callee functions depending on the error, which can only be handled in an ugly fashion. But even if there’s no such problem, you have to test every function call for errors. If a function didn’t originally return error information but now does, you have to change all points of call.

Nobody ever does it.

It is to me an upper bound that apparently all programming languages without exception handling cover up errors at the language level.

A designer of a successful programming language, whatever you think of that language, is definitely an above average programmer, actually I think one that can safely be assumed to occupy the top tenth no matter how you rate programming ability (and really, how do you?). Moreover, the language designer is also in the top tenth if programmers are sorted by the extent of importance they assign to their work and motivation to get things right - because the problems are fundamental, because of the ego trip, because of everything.

And still, once decision or chance lead them into a situation where the language has no exceptions, they allow themselves to have errors covered up throughout their cherished system. Despite the obvious pain it causes them, as evident from the need for rationalizations ranging from runtime overhead (as if you couldn’t have an explicit unsafe subset for that - see C#) to the Postmodern Butler argument (”Fuck you plenty? Very good, sir!”).

What does this leave us to expect from average programs? The ones written in the order of execution, top to bottom, with the same intellectual effort that goes into driving from point A to point B? The ones not exposed to the kind of massive testing/review that could exterminate bugs in a flophouse?

This is why nothing inspires my trust in a program like a tendency to occasionally wet its pants with an error call stack. Leaving some exceptions uncaught proves the programmers are somewhat sloppy but I wouldn’t guess otherwise anyway. What’s more important is that because of exceptions along the road, the program most likely won’t make it to the code where shooting me in the foot is implemented. And if a program never spills its guts this way, especially if it’s the kind of program with lots of I/O going on, I’m certain that under the hood, it’s busy silencing desperate screams: where there are no exceptions, there must be cover-up.