Side effects or not, aliasing kills you

My programming personality is very depressed right now. For a while, I've been away from my usual long-term stuff and back in the trenches. I'm doing simple things, but there's lots of them, so it feels like I'm not moving anywhere even though I go as fast as I can. It's like a highway with monotonic view.

Well, this has completely drained my brain. So I won't tell you anything in this blog - instead, I'll ask you a question. This way, it's OK for me to be wrong, which is even more likely than usual in my current state. Clever trick, isn't it?

My question basically is, how is not having side effects in your programming language going to help you write correct, and optimal, and parallel programs? I've already mentioned it, but this time, I'll elaborate on the point that I fail to get.

The problem with parallelism is scheduling. We want scheduling to be:

  1. Correct: when A depends on B, B should happen before A.
  2. Optimal: A, B and everything else should complete as soon as possible.

So, in order to correctly schedule basic blocks of code called A and B, I need to know if they depend on each other. If I have side effects, this is going to be hard. A modifies an object called X, and B modifies Y. Can X and Y be the same thing? To figure it out, you need whole-program analysis, and if your language is Turing-complete, WPA won't necessarily succeed, either.

This is the aliasing problem. The aliasing problem rules. It's the problem that forces you to use restrict to optimize numeric code. It's the problem that forces you to use static typing if you want to optimize anything at all. And it's the problem that makes all safe languages completely and wildly unsafe when you have multiple threads. The rule of thumb is: if you have a good idea for compile-time or run-time anything, it won't work because of the aliasing problem.

Now, suppose we got rid of those pesky side effects. A doesn't modify X; instead, it creates a new object, Z. Similarly, B takes Y and makes W. Now we can prove that A doesn't depend on B. As soon as their inputs are ready, they can run. This takes care of the correctness problem, which was listed as the scheduling problem #1 at the beginning.

However, this didn't solve the aliasing problem - X and Y can still be the same. I suspect that the aliasing problem is going to raise its ugly head and kick our pretty butt. Specifically, while semantically we made Z from X, we'd really like to wipe out X and reuse its memory for hosting Z. Because that would be more efficient than copying X and modifying some of it and keeping two objects around. Now, we can't do this optimization if X and Y are the same thing, and we know nothing about the execution order of A and B, can we? If A replaced X with Z, B would get the wrong Y.

And if we copy things, we aren't going to get an optimal scheduling, because A and B now run slower - problem #2, right?

Seriously, what do I miss? I realize that there are parallel execution environments where you have to copy things anyway because you have a zillion processors and a zillion processors can't have shared memory. I also realize that you can have few objects shared between tasks that run in parallel, so the overhead of copying them isn't very important. I realize all kinds of things making the aliasing problem a non-problem for side-effect-free parallelization frameworks or languages or whatever.

What I fail to see is, if you run in a single-box, multi-core system, and you do have shared memory, and you do have lots of shared data, how does removing side effects help you? With side effects, you can get an efficient program, if you can handle the debugging. Without side effects, you get a correct program, and then you must somehow optimize object allocation. And I just don't see how you'd go about that without compromising correctness.

Maybe I'm ignorant. Maybe I should have read about this somewhere. But where do I find a levelheaded discussion of, say, common implementation techniques of pure functional language compilers? I just keep bumping into "sufficiently smart compiler" arguments, and then there are people who don't care much about performance (and shouldn't because of their application domain).

Compare this to imperative languages. Anybody can write a pretty decent C compiler. If you don't optimize at all, you get what, 4x slowdown compared to the state of the art optimizer? And on a scalar RISC machine, I bet you can bridge 50%-70% of that gap by your cheapest shot at register allocation. A no-brainer, you need about 30% of a comp sci degree and you're all set. Now, how am I supposed to optimize allocation if there are no side effects? I mean, your spec or your family of specs should come with implementation techniques. "Useful" is good by itself, but it must also be implementable.

By the way, I'd love to find out that I'm wrong on this one. I've done lots of parallelism-related debugging, and it sucks. And maybe I am wrong. Maybe there just isn't much talk about the implementation details of these things. I do think my reasoning makes sense though.

16 comments ↓

#1 Doug on 03.21.08 at 6:54 pm

What language are you talking about? C?

Using immutable (side-effect free) aliased data in a language with automatic garbage collection is quite straightforward. As long as any thread holds a reference (pointer) to the data, it stays allocated. When no threads hold references to the data, the memory can be reused.

Also, as far as making modified "copies", it's not necessary to actually copy. You can also use a "wrapper" object that dynamically applies the change(s) whenever the underlying data is accessed.

All of the "pure functional languages" that I know of use immutable (side-effect free) data with automatic garbage collection. It's also usually pretty easy to create those wrapper objects.

If automatic garbage collection is unacceptable for your application, you might want to look at the ADA language. It specifically restricts how aliasing can be used.

In the end, you can't have everything. Something will be compromised, whether it's efficiency, debugability, simplicity, reliability, or whatever. If there was a way that we could somehow "have it all", we would simply do things that way all of the time.

#2 Yossi Kreinin on 03.22.08 at 1:47 am

I was talking about a hypothetic language that was supposed to be efficient and guarantee correct parallelization, something I'd need to write a compiler for.

Suppose you can have automatic garbage collection. Now you enter A which makes Z from X. In order to create Z where X used to be, you have to know whether X is still used *right now*. Since you can't look at a ref count, which would break in presence of cyclic references, and garbage collection only runs periodically, you are likely to lack the "confidence" needed to dispose of X when you need to do that.

As to wrapper objects, that seems to add a level of memory indirection, which can be nasty performance-wise.

Intuitively, I think about converting programs with side effects into programs without them. Imperative programs frequently create large objects and then modify them; those modifications would look like copying most of the object if translated straightforwardly to a functional style. It could be that "native" functional programs look very different and so the real optimization problems are also different.

Regarding "have it all": I couldn't agree more :)

#3 cgibbard on 03.22.08 at 4:26 pm

Well, functional programs really do look rather different from typical imperative programs. There are special techniques for getting performance from immutable datastructures where you can be certain that once computed, the data will remain the same forever. These mostly have to do with sharing parts of datastructures which are really the same. For instance, you might have a set which is represented by some balanced binary tree structure. As opposed to making a modified wholesale copy of that tree, the function which, say, adds an element to that set will only generate log(n) new tree nodes — the path from the new node up to the root. All the other subtrees will simply be shared. In the case of mutable datastructures, this would be a bad policy, since modifications to the original set would then be strangely and perhaps incoherently reflected in the new one. It's the immutability which allows you to do it.

In general, in pure value-oriented languages, when you make a change to just some small part of a large datastructure, you generally only need to copy a small part of it (ideally O(log(n)) assuming that you've selected the right datastructure for your access pattern), and everything else just gets referred to.

There are also techniques for ensuring that datastructures are not constructed at all, by fusing the code which produces the datastructure with the code that consumes it, generally referred to as deforestation. If you have one producer and multiple consumers, there are of course limits on what you can do. Uniqueness typing is a way to get the compiler to enforce that a value is only ever used once, so that after that happens, the space used for it can be reclaimed and used for something else. Even without that though, the techniques for dealing with many short-lived objects are getting quite good for most purposes. This is typically to the extent that we hardly worry about it in most cases. You don't really think in terms of one thing getting mutated over and over as much as just what the value you're interested in is, and what other values might be useful in order to construct it. It's only once you have an explicit problem with space, and profiling turns up that you need to be careful with the lifetime of values that you start to worry about retention and making sure that things can be reclaimed earlier.

#4 Doug on 03.22.08 at 8:50 pm

You don't need to know whether X is still used *right now*. All you need to know is that the memory being used for your new object is no longer in use for anything else. How long ago it quit being used for anything else is unimportant. We don't need to "dispose of X" the instant that it is no longer in use - disposal can wait.

Most automatic garbage collection systems keep a pool of unused memory. New objects are created from that pool. The garbage collection system regularly sweeps through the data and finds unused objects and adds the associated memory to the pool. An unused object might sit around for a long time before the garbage collector realizes that the object is no longer being used. Then the reclaimed memory might sit around in the pool a long time before it is actually reused. No big deal.

It sounds to me like you're worrying about inefficiencies that are inconsequential for most programming tasks. "A level of memory indirection" is measured in the nanoseconds these days, and hardly qualifies as "nasty performance-wise" in the overwhelming majority of software domains.

For the few domains where this kind of nanosecond nitpicking is necessary, functional programming would never do. Today's multi-threaded language of choice for those domains is probably ADA.

In addition to ADA, you might want to look at Occam for ideas.

#5 cgibbard on 03.22.08 at 9:50 pm

Interestingly enough, when nanoseconds *do* matter, a very nice tactic is to use a functional language or some other high level language to construct assembly or other low-level code. Such programs can do lots of work to try different implementations of the task you're interested in, which someone coding directly in assembly would never be able to look through, and since you have domain specific knowledge of the problem, you can make use of algebraic properties which ordinary compilers don't know about.

A good example of this is FFTW (the Fastest Fourier Transform in the West), which is an O'Caml program that generates C programs to compute discrete Fourier transforms.

A more general approach to this sort of thing is being worked on in the Coconut project at McMaster University. (You can watch a Google video about that here: http://uk.youtube.com/watch?v=yHd0u6zuWdw)

They're designing a very high level Haskell DSL for automatically constructing efficient SIMD and SMP code for signal processing tasks. On some tasks they're doing 4x better than the best hand-optimised programs written in C with inline assembly, because their tool can search a broad space of possible implementations and make use of algebraic properties of the problem at hand while doing it.

#6 orib on 03.23.08 at 4:48 am

You can do away with pointers. APL, K and J are functional vector languages that do not allow pointers, thus making reference count authoritative with respect to safety of changing memory (just one reference means it's ok to change). I think Tcl does the same but I'm not sure.

Of course, this only happens at the language level — if you manufacture pointers yourself, e.g. by using array coordinates of a global variable indirectly, you're still in the same place. However, being (mostly) functional, 99% of the code written does not do any such thing, it is also parallelizable.

If you're not familiar with J, K or APL, you should — you'll probably like them. They do away with a lot of the unneeded cruft we've come to accept as necessary evil when programming. There's a steep learning curve, and the syntax is not newbie friendly, but the reward is programs that are 10-100 times shorter than their C equivalent (usually closer to the 100 part of the scale), while usually — at least K — comparable in speed.

APL was invented as an "algorithm notation language" for paper back in the fifties, and then implemented in the 60's when someone noticed that it's well defined enough to execute. J and K are two descendants, the former focusing on completeness and mathematical elegance (IMHO, failed), and the later focusing on speed, terseness and with a bias towards financial applications.

#7 Yossi Kreinin on 03.24.08 at 1:17 pm

To Doug: I am worried about the inefficiency of the delayed reclaiming and all the "right now" business because I'm anal-retentive about performance - not in general, but in the specific context of real time embedded machine vision apps, which is the context where this entry is supposed to belong. I realize that your typical functional programming application domain is different. Thanks for the pointer on aliasing restriction in Ada - I'll have a look.

To cgibbard: Thanks for the link; I love code generation. I do want to somehow eliminate side-effect/parallelism problems at the target machine though; it's a separate story.

To orib: I think that vector languages like APL are the way to go if you do data parallelism, but I'm not sure about task parallelism, which is what I have in mind. I'd like to look at K nonetheless; the 100x part sounds too good to be true…

#8 orib on 03.24.08 at 5:11 pm

I know it sounds too good to be true. And to get there, it does take dropping all the useless things we studied in the last 20 years or so of programming.

qn:{[n],/{:[n=#*x;,*x;,/_f'f x,\:/:(!n)_dvl,/x]}'(f:0 -1 1+/:)@!n}
bd:{[p]`0:".Q"p=\:!#p}

Yep, I know it looks unreadable, but it is not obfuscated, just very terse (did I mention the learning curve is steep?)

The first line "qn: …" defines a function of one variable, n, that solves the n-queen problem. The second line "bd:…" defines a function that prints a solution produced by the first line. http://nsl.com/papers/qn.htm for more.

A sudoku solver:
p,:3/:_(p:9\:!81)%3
s:{*(,x)(,/{@[x;y;:;]'&21=x[&|/p[;y]=p]?!10}')/&~x}

That's actually idiomatic, NOT obfuscated. You'll have to trust me on this.

A function to (efficiently!) compute maximum running sum inside a list of numbers is:
|/0(0|+)\

(that is: find the maximal sum between indices i and j for all possible indices j>i; This would be the whole vector if all numbers are nonnegative, but not if there are negative numbers).

A function to compute an average of a vector is {(+/x)%#x} read "sum over x divided by count of x". It is so straightforward that you see it inline even though it's in the standard library (as "avg" — the name "average" is as long as the function definition and lacks the semantics that the function itself carries)

A scalar product function is written {+/x*y} or just +/* and to turn it into a a matrix product, {+/*\:/:[x;+y]}

A small database with a concept of order (more than relational), aggregation, etc implemented in 14 short lines can be found in http://nsl.com/k/t.k

K takes the APL approach to a radical place: Use ~60 basic functions, most of which aren't trivial, but which are mostly orthogonal and well chosen — apply them to very basic data types that include various atoms and vectors — and you find out that you don't need as much abstraction as all languages strive to present.

Now, having an APL heritage, those 60 functions are mapped onto ascii symbols, with each symbol getting one function as an unary op and another as a binary op (e.g., binary * is multiply, unary * is "first" — rhymes with pointer use in C). But there's a syntactic sugar layer on K (called "q") which makes everything into English, making "#&~~':" into the more readable "count where not eachprev match"

Being so terse, it's possible for the eyes to learn to pick patterns (like "average" above) instantaneously, and scan the whole database implementation on the screen at once.

Now, the other side is that APL and K are actually functional languages, thus they are also well suited for task parallelism. Specifically, K implements a no-pointer reference counting scheme, which ensures that all memory is reclaimed exactly when possible (although temporaries aren't efficiently handled in the current implementation). Running 4 functions simultaneously is as easy as writing .:': (f1;f2;f3;f4)

I am in no way affiliated with the people who make K. But my programming style has changed significantly since I was introduced to it — contributing to shorter programs, which are usually faster, and less prone to bugs, although I must admit less welcoming to a new maintainer.

#9 Yossi Kreinin on 03.26.08 at 12:26 pm

I ought to take a closer look at APL just to get a feeling how they do this… I can't use it as a basis for a DSL because I write DSLs for other people to use, my target audience normally being purposefully unsophisticated about their syntactic preference and just wishing for their usual style to compile (in other words, "not language geeks").

It's very easy to intrigue me with an opportunity to make code more compact though. In particular, I used to be intrigued by Forth, and to an extent, I still am. Examples of compact Forth programs: http://www.jwdt.com/~paysan/screenful.html

I currently believe that Forth cheats :) That is, the terseness comes at a price I wouldn't like to pay; a typical longer program in a more popular style doing the same thing would have features making it easier to deal with. For example, the Forth object system in the above-mentioned page isn't a substitute for a typical "object system" outside of Forth.

#10 kragen on 04.24.08 at 4:36 am

Simon Peyton Jones has put his book on implementing pure functional programming languages online. I found it very interesting to read, but I don't remember the title.

I think that if you want to get noticeable parallel speedups on multicore processors, you may have to give up on update-in-place of shared data structures. One way to do this is to take the Tcl approach of never having shared objects, ever, anywhere; this is called "linearity" or "linear logic". It would be a shame if you had to give up on update-in-place, because it pretty much rules out any kind of combinator-graph reduction machine. Somebody (Bawden? I forget) did a dissertation in the 80s on distributed linear graph reduction, though.

I think your title is a little misleading. The level at which aliasing causes problems is exactly the level at which side effects are involved.

#11 Yossi Kreinin on 04.24.08 at 11:26 am

"never having shared objects, ever, anywhere"

No can do. I ain't at no datacenter where you can have a 5x slowdown throughout the language/system and it's nothing as long as it allows to reasonably utilize the gazillion of processors you have. I'm talking about 10-20 core chips for real time embedded stuff, for example, where a 5x slowdown means you now have 2-4 cores, which is totally dumb. So I need the speed of C minus 10% overhead, tops. Can't use message passing to copy large objects, for example.

#12 kragen on 04.25.08 at 5:27 pm

Well, the whole linear-logic thing always sounded pretty crazy to me too, but apparently Henry Baker and the other guys who did this got pretty good benchmarks using it back in the 80s, and e.g. Wouter van Oortmerssen is still enthusiastic about the idea today. Sometimes they used reference-counting to make language-level "copies" actually happen lazily.

But I don't know if you can get within 10% of C speed with linear languages. Certainly they haven't hit the mainstream in the 25 years since they were invented.

How many 6502s can you fit in the transistor budget of a Xeon?

#13 jamii on 06.08.08 at 4:00 am

You might find Haskells nested data parallelism to be more interesting for image processing and similar style tasks. Its very much a work in progress but they've already shown some impressive benchmarks.

The basic idea is that functions are written as a combination of stream-like operators and array comprehensions on arrays of any type. The compiler then transforms this into operations on flat arrays with primitive types which can then be split across multiple processors. Stream fusion (http://lambda-the-ultimate.org/node/2192) is used to remove intermediate arrays.

One of the accepted summer of code projects for this year is a physics engine built using ndp which should be an interesting demonstration of the method.

#14 Yossi Kreinin on 06.09.08 at 4:09 am

Interesting. I somehow have the feeling that the SIMD instruction selection problem, the one that inflicts intrinsics upon us on all SIMD targets, isn't going to be attacked here. Distributing to multiple cores, and perhaps some forms of tiling for cache optimization, are more likely to work out - Intel are doing this with IPP in C.

#15 Alaric Snell-Pym on 09.14.11 at 4:36 am

You don't need to abandon sharing to use linear logic.

In Concurrent Clean, references may be linear (in which case the type system ensures they are the sole reference to whatever it refers to) or no.

If you mutate something you have a linear reference to, it's mutated in place and the same reference is returned.

If you mutate something you have a nonlinear reference to, it's copied afresh and the copy mutated, and a new (linear, as you just made it and haven't shared it) reference is returned, so subsequent mutations will be "fast".

(More or less. Some artistic licence was applied in that description.)

#16 Yossi Kreinin on 09.27.11 at 11:15 pm

Why is it called "linear", and what happens when two distinct copies of the same thing are created and modified as you described?

Leave a Comment