Side effects or not, aliasing kills you
My programming personality is very depressed right now. For a while, I've been away from my usual long-term stuff and back in
the trenches. I'm doing simple things, but there's lots of them, so it feels like I'm not moving anywhere even though I go as
fast as I can. It's like a highway with monotonic view.
Well, this has completely drained my brain. So I won't tell you anything in this blog – instead, I'll ask you a question.
This way, it's OK for me to be wrong, which is even more likely than usual in my current state. Clever trick, isn't it?
My question basically is, how is not having side effects in your programming language going to help you write correct,
and optimal, and parallel programs? I've already mentioned it, but this time, I'll elaborate on the point that
I fail to get.
The problem with parallelism is scheduling. We want scheduling to be:
- Correct: when A depends on B, B should happen before A.
- Optimal: A, B and everything else should complete as soon as possible.
So, in order to correctly schedule basic blocks of code called A and B, I need to know if they depend on each other. If I
have side effects, this is going to be hard. A modifies an object called X, and B modifies Y. Can X and Y be the same thing? To
figure it out, you need whole-program analysis, and if your language is Turing-complete, WPA won't necessarily succeed,
either.
This is the aliasing problem. The aliasing problem rules. It's the problem that forces you to use restrict
to
optimize numeric code. It's the problem that forces you to use static typing if you want to optimize anything at all. And it's
the problem that makes all safe languages completely and wildly unsafe when you have multiple threads. The rule of thumb is: if
you have a good idea for compile-time or run-time anything, it won't work because of the aliasing problem.
Now, suppose we got rid of those pesky side effects. A doesn't modify X; instead, it creates a new object, Z. Similarly, B
takes Y and makes W. Now we can prove that A doesn't depend on B. As soon as their inputs are ready, they can run. This takes
care of the correctness problem, which was listed as the scheduling problem #1 at the beginning.
However, this didn't solve the aliasing problem – X and Y can still be the same. I suspect that the aliasing problem is going
to raise its ugly head and kick our pretty butt. Specifically, while semantically we made Z from X, we'd really like to wipe out
X and reuse its memory for hosting Z. Because that would be more efficient than copying X and modifying some of it and keeping
two objects around. Now, we can't do this optimization if X and Y are the same thing, and we know nothing about the execution
order of A and B, can we? If A replaced X with Z, B would get the wrong Y.
And if we copy things, we aren't going to get an optimal scheduling, because A and B now run slower – problem #2, right?
Seriously, what do I miss? I realize that there are parallel execution environments where you have to copy things anyway
because you have a zillion processors and a zillion processors can't have shared memory. I also realize that you can have few
objects shared between tasks that run in parallel, so the overhead of copying them isn't very important. I realize all kinds of
things making the aliasing problem a non-problem for side-effect-free parallelization frameworks or languages or whatever.
What I fail to see is, if you run in a single-box, multi-core system, and you do have shared memory, and you
do have lots of shared data, how does removing side effects help you? With side effects, you can get an efficient
program, if you can handle the debugging. Without side effects, you get a correct program, and then you must somehow optimize
object allocation. And I just don't see how you'd go about that without compromising correctness.
Maybe I'm ignorant. Maybe I should have read about this somewhere. But where do I find a levelheaded discussion of, say,
common implementation techniques of pure functional language compilers? I just keep bumping into "sufficiently smart compiler"
arguments, and then there are people who don't care much about performance (and shouldn't because of their application
domain).
Compare this to imperative languages. Anybody can write a pretty decent C compiler. If you don't optimize at all, you get
what, 4x slowdown compared to the state of the art
optimizer? And on a scalar RISC machine, I bet you can bridge 50%-70% of that gap by your cheapest shot at register allocation.
A no-brainer, you need about 30% of a comp sci degree and you're all set. Now, how am I supposed to optimize allocation if there
are no side effects? I mean, your spec or your family of specs should come with implementation techniques. "Useful" is good by
itself, but it must also be implementable.
By the way, I'd love to find out that I'm wrong on this one. I've done lots of parallelism-related debugging, and it
sucks. And maybe I am wrong. Maybe there just isn't much talk about the implementation details of these things. I do think my
reasoning makes sense though.
What language are you talking about? C?
Using immutable (side-effect free) aliased data in a language with
automatic garbage collection is quite straightforward. As long as any
thread holds a reference (pointer) to the data, it stays allocated. When
no threads hold references to the data, the memory can be reused.
Also, as far as making modified "copies", it's not necessary to
actually copy. You can also use a "wrapper" object that dynamically
applies the change(s) whenever the underlying data is accessed.
All of the "pure functional languages" that I know of use immutable
(side-effect free) data with automatic garbage collection. It's also
usually pretty easy to create those wrapper objects.
If automatic garbage collection is unacceptable for your application,
you might want to look at the ADA language. It specifically restricts
how aliasing can be used.
In the end, you can't have everything. Something will be compromised,
whether it's efficiency, debugability, simplicity, reliability, or
whatever. If there was a way that we could somehow "have it all", we
would simply do things that way all of the time.
I was talking about a hypothetic language that was supposed to be
efficient and guarantee correct parallelization, something I'd need to
write a compiler for.
Suppose you can have automatic garbage collection. Now you enter A
which makes Z from X. In order to create Z where X used to be, you have
to know whether X is still used *right now*. Since you can't look at a
ref count, which would break in presence of cyclic references, and
garbage collection only runs periodically, you are likely to lack the
"confidence" needed to dispose of X when you need to do that.
As to wrapper objects, that seems to add a level of memory
indirection, which can be nasty performance-wise.
Intuitively, I think about converting programs with side effects into
programs without them. Imperative programs frequently create large
objects and then modify them; those modifications would look like
copying most of the object if translated straightforwardly to a
functional style. It could be that "native" functional programs look
very different and so the real optimization problems are also
different.
Regarding "have it all": I couldn't agree more :)
Well, functional programs really do look rather different from
typical imperative programs. There are special techniques for getting
performance from immutable datastructures where you can be certain that
once computed, the data will remain the same forever. These mostly have
to do with sharing parts of datastructures which are really the same.
For instance, you might have a set which is represented by some balanced
binary tree structure. As opposed to making a modified wholesale copy of
that tree, the function which, say, adds an element to that set will
only generate log(n) new tree nodes — the path from the new node up to
the root. All the other subtrees will simply be shared. In the case of
mutable datastructures, this would be a bad policy, since modifications
to the original set would then be strangely and perhaps incoherently
reflected in the new one. It's the immutability which allows you to do
it.
In general, in pure value-oriented languages, when you make a change
to just some small part of a large datastructure, you generally only
need to copy a small part of it (ideally O(log(n)) assuming that you've
selected the right datastructure for your access pattern), and
everything else just gets referred to.
There are also techniques for ensuring that datastructures are not
constructed at all, by fusing the code which produces the datastructure
with the code that consumes it, generally referred to as deforestation.
If you have one producer and multiple consumers, there are of course
limits on what you can do. Uniqueness typing is a way to get the
compiler to enforce that a value is only ever used once, so that after
that happens, the space used for it can be reclaimed and used for
something else. Even without that though, the techniques for dealing
with many short-lived objects are getting quite good for most purposes.
This is typically to the extent that we hardly worry about it in most
cases. You don't really think in terms of one thing getting mutated over
and over as much as just what the value you're interested in is, and
what other values might be useful in order to construct it. It's only
once you have an explicit problem with space, and profiling turns up
that you need to be careful with the lifetime of values that you start
to worry about retention and making sure that things can be reclaimed
earlier.
You don't need to know whether X is still used *right now*. All you
need to know is that the memory being used for your new object is no
longer in use for anything else. How long ago it quit being used for
anything else is unimportant. We don't need to "dispose of X" the
instant that it is no longer in use – disposal can wait.
Most automatic garbage collection systems keep a pool of unused
memory. New objects are created from that pool. The garbage collection
system regularly sweeps through the data and finds unused objects and
adds the associated memory to the pool. An unused object might sit
around for a long time before the garbage collector realizes that the
object is no longer being used. Then the reclaimed memory might sit
around in the pool a long time before it is actually reused. No big
deal.
It sounds to me like you're worrying about inefficiencies that are
inconsequential for most programming tasks. "A level of memory
indirection" is measured in the nanoseconds these days, and hardly
qualifies as "nasty performance-wise" in the overwhelming majority of
software domains.
For the few domains where this kind of nanosecond nitpicking is
necessary, functional programming would never do. Today's multi-threaded
language of choice for those domains is probably ADA.
In addition to ADA, you might want to look at Occam for ideas.
Interestingly enough, when nanoseconds *do* matter, a very nice
tactic is to use a functional language or some other high level language
to construct assembly or other low-level code. Such programs can do lots
of work to try different implementations of the task you're interested
in, which someone coding directly in assembly would never be able to
look through, and since you have domain specific knowledge of the
problem, you can make use of algebraic properties which ordinary
compilers don't know about.
A good example of this is FFTW (the Fastest Fourier Transform in the
West), which is an O'Caml program that generates C programs to compute
discrete Fourier transforms.
A more general approach to this sort of thing is being worked on in
the Coconut project at McMaster University. (You can watch a Google
video about that here: http://uk.youtube.com/watch?v=yHd0u6zuWdw)
They're designing a very high level Haskell DSL for automatically
constructing efficient SIMD and SMP code for signal processing tasks. On
some tasks they're doing 4x better than the best hand-optimised programs
written in C with inline assembly, because their tool can search a broad
space of possible implementations and make use of algebraic properties
of the problem at hand while doing it.
You can do away with pointers. APL, K and J are functional vector
languages that do not allow pointers, thus making reference count
authoritative with respect to safety of changing memory (just one
reference means it's ok to change). I think Tcl does the same but I'm
not sure.
Of course, this only happens at the language level — if you
manufacture pointers yourself, e.g. by using array coordinates of a
global variable indirectly, you're still in the same place. However,
being (mostly) functional, 99% of the code written does not do any such
thing, it is also parallelizable.
If you're not familiar with J, K or APL, you should — you'll probably
like them. They do away with a lot of the unneeded cruft we've come to
accept as necessary evil when programming. There's a steep learning
curve, and the syntax is not newbie friendly, but the reward is programs
that are 10-100 times shorter than their C equivalent (usually closer to
the 100 part of the scale), while usually — at least K — comparable in
speed.
APL was invented as an "algorithm notation language" for paper back
in the fifties, and then implemented in the 60's when someone noticed
that it's well defined enough to execute. J and K are two descendants,
the former focusing on completeness and mathematical elegance (IMHO,
failed), and the later focusing on speed, terseness and with a bias
towards financial applications.
To Doug: I am worried about the inefficiency of the delayed
reclaiming and all the "right now" business because I'm anal-retentive
about performance – not in general, but in the specific context of real
time embedded machine vision apps, which is the context where this entry
is supposed to belong. I realize that your typical functional
programming application domain is different. Thanks for the pointer on
aliasing restriction in Ada – I'll have a look.
To cgibbard: Thanks for the link; I love code generation. I do want
to somehow eliminate side-effect/parallelism problems at the target
machine though; it's a separate story.
To orib: I think that vector languages like APL are the way to go if
you do data parallelism, but I'm not sure about task parallelism, which
is what I have in mind. I'd like to look at K nonetheless; the 100x part
sounds too good to be true...
I know it sounds too good to be true. And to get there, it does take
dropping all the useless things we studied in the last 20 years or so of
programming.
qn:{[n],/{:[n=#*x;,*x;,/_f'f x,:/:(!n)_dvl,/x]}'(f:0 -1
1+/:)@!n}
bd:{[p]`0:".Q"p=:!#p}
Yep, I know it looks unreadable, but it is not obfuscated, just very
terse (did I mention the learning curve is steep?)
The first line "qn: ..." defines a function of one variable, n, that
solves the n-queen problem. The second line "bd:..." defines a function
that prints a solution produced by the first line. http://nsl.com/papers/qn.htm for more.
A sudoku solver:
p,:3/:_(p:9:!81)%3
s:{*(,x)(,/{@[x;y;:;]'&21=x[&|/p[;y]=p]?!10}')/&~x}
That's actually idiomatic, NOT obfuscated. You'll have to trust me on
this.
A function to (efficiently!) compute maximum running sum inside a
list of numbers is:
|/0(0|+)
(that is: find the maximal sum between indices i and j for all
possible indices j>i; This would be the whole vector if all numbers
are nonnegative, but not if there are negative numbers).
A function to compute an average of a vector is {(+/x)%#x} read "sum
over x divided by count of x". It is so straightforward that you see it
inline even though it's in the standard library (as "avg" — the name
"average" is as long as the function definition and lacks the semantics
that the function itself carries)
A scalar product function is written {+/x*y} or just +/* and to turn
it into a a matrix product, {+/*:/:[x;+y]}
A small database with a concept of order (more than relational),
aggregation, etc implemented in 14 short lines can be found in http://nsl.com/k/t.k
K takes the APL approach to a radical place: Use ~60 basic functions,
most of which aren't trivial, but which are mostly orthogonal and well
chosen — apply them to very basic data types that include various atoms
and vectors — and you find out that you don't need as much abstraction
as all languages strive to present.
Now, having an APL heritage, those 60 functions are mapped onto ascii
symbols, with each symbol getting one function as an unary op and
another as a binary op (e.g., binary * is multiply, unary * is "first" —
rhymes with pointer use in C). But there's a syntactic sugar layer on K
(called "q") which makes everything into English, making "#&~~':"
into the more readable "count where not eachprev match"
Being so terse, it's possible for the eyes to learn to pick patterns
(like "average" above) instantaneously, and scan the whole database
implementation on the screen at once.
Now, the other side is that APL and K are actually functional
languages, thus they are also well suited for task parallelism.
Specifically, K implements a no-pointer reference counting scheme, which
ensures that all memory is reclaimed exactly when possible (although
temporaries aren't efficiently handled in the current implementation).
Running 4 functions simultaneously is as easy as writing .:':
(f1;f2;f3;f4)
I am in no way affiliated with the people who make K. But my
programming style has changed significantly since I was introduced to it
— contributing to shorter programs, which are usually faster, and less
prone to bugs, although I must admit less welcoming to a new
maintainer.
I ought to take a closer look at APL just to get a feeling how they
do this... I can't use it as a basis for a DSL because I write DSLs for
other people to use, my target audience normally being purposefully
unsophisticated about their syntactic preference and just wishing for
their usual style to compile (in other words, "not language geeks").
It's very easy to intrigue me with an opportunity to make code more
compact though. In particular, I used to be intrigued by Forth, and to
an extent, I still am. Examples of compact Forth programs: http://www.jwdt.com/~paysan/screenful.html
I currently believe that Forth cheats :) That is, the terseness comes
at a price I wouldn't like to pay; a typical longer program in a more
popular style doing the same thing would have features making it easier
to deal with. For example, the Forth object system in the
above-mentioned page isn't a substitute for a typical "object system"
outside of Forth.
Simon Peyton Jones has put his book on implementing pure functional
programming languages online. I found it very interesting to read, but I
don't remember the title.
I think that if you want to get noticeable parallel speedups on
multicore processors, you may have to give up on update-in-place of
shared data structures. One way to do this is to take the Tcl approach
of never having shared objects, ever, anywhere; this is called
"linearity" or "linear logic". It would be a shame if you had to give up
on update-in-place, because it pretty much rules out any kind of
combinator-graph reduction machine. Somebody (Bawden? I forget) did a
dissertation in the 80s on distributed linear graph reduction,
though.
I think your title is a little misleading. The level at which
aliasing causes problems is exactly the level at which side effects are
involved.
"never having shared objects, ever, anywhere"
No can do. I ain't at no datacenter where you can have a 5x slowdown
throughout the language/system and it's nothing as long as it allows to
reasonably utilize the gazillion of processors you have. I'm talking
about 10-20 core chips for real time embedded stuff, for example, where
a 5x slowdown means you now have 2-4 cores, which is totally dumb. So I
need the speed of C minus 10% overhead, tops. Can't use message passing
to copy large objects, for example.
Well, the whole linear-logic thing always sounded pretty crazy to me
too, but apparently Henry Baker and the other guys who did this got
pretty good benchmarks using it back in the 80s, and e.g. Wouter van
Oortmerssen is still enthusiastic about the idea today. Sometimes they
used reference-counting to make language-level "copies" actually happen
lazily.
But I don't know if you can get within 10% of C speed with linear
languages. Certainly they haven't hit the mainstream in the 25 years
since they were invented.
How many 6502s can you fit in the transistor budget of a Xeon?
You might find Haskells nested data parallelism to be more
interesting for image processing and similar style tasks. Its very much
a work in progress but they've already shown some impressive
benchmarks.
The basic idea is that functions are written as a combination of
stream-like operators and array comprehensions on arrays of any type.
The compiler then transforms this into operations on flat arrays with
primitive types which can then be split across multiple processors.
Stream fusion (http://lambda-the-ultimate.org/node/2192) is used to
remove intermediate arrays.
One of the accepted summer of code projects for this year is a
physics engine built using ndp which should be an interesting
demonstration of the method.
Interesting. I somehow have the feeling that the SIMD instruction
selection problem, the one that inflicts intrinsics upon us on all SIMD
targets, isn't going to be attacked here. Distributing to multiple
cores, and perhaps some forms of tiling for cache optimization, are more
likely to work out – Intel are doing this with IPP in C.
You don't need to abandon sharing to use linear logic.
In Concurrent Clean, references may be linear (in which case the type
system ensures they are the sole reference to whatever it refers to) or
no.
If you mutate something you have a linear reference to, it's mutated
in place and the same reference is returned.
If you mutate something you have a nonlinear reference to, it's
copied afresh and the copy mutated, and a new (linear, as you just made
it and haven't shared it) reference is returned, so subsequent mutations
will be "fast".
(More or less. Some artistic licence was applied in that
description.)
Why is it called "linear", and what happens when two distinct copies
of the same thing are created and modified as you described?
Post a comment