Coroutines in one page of C

August 19th, 2013

A coroutine is a function that you can jump back into after returning from it - and it remembers where it was in the code, and all the variables. This is very useful at times.

One use is generating a sequence of values. Here's how you can generate all the x,y pairs in a 2D range in Python:

def iterate(max_x, max_y):
  for x in range(max_x):
    for y in range(max_y):
      yield x,y

for x,y in iterate(2,2):
  print x,y

This prints:

The yield keyword is like return, except you can call the function again and it proceeds from the point where it left. Without the syntactic sugar, the loop using iterate() would look like this:

coroutine = iterate(2,2)
while True:
  try:
    x,y = coroutine.next()
    print x,y
  except StopIteration:
    break

So iterate() returns a coroutine object, and you can call coroutine.next() to get the next value produced by yield. Eventually, iterate() runs out of values and returns, and then you get a StopIteration exception. That's Python's protocol for coroutines.

This can be nicer than straightforwardly iterating over the 2D range with 2 nested loops, because we're abstracting away the nested loops - or recursive tree/graph walks, or other hairy traversals. And it's nicer than writing iterator classes to abstract away traversals, because you don't need to manually compile the logic down to a state machine, as in "if(_x==_max_x) _y++; else _x++". The iteration is both abstracted away and can be seen plainly in the code implementing it.

Coroutines have other uses; a fashionable one is multiplexing many concurrent processes onto a small thread pool. As in, goroutines are coroutines, just, um, spelled differently.

Unfortunately, like many other languages, C doesn't have coroutines. What it does have though is setjmp/longjmp, which come very close to coroutine support - the missing step is just a sprinkle of inline assembly.

On some systems you actually get full coroutine support out of the box - as Wikipedia points out, you don't need any assembly if you have makecontext and friends. Many embedded systems don't have makecontext though, and do have setjmp/longjmp, so it's interesting how we can get by with just that.

The example code below isn't a full-fledged coroutine library - there are many libraries to choose from already. Rather, it shows what it takes to roll your own coroutines if you need to - which is just a page of code.

So here's how the iterator example can look in C with coroutines:

#include <stdio.h>
#include "coroutine.h"

typedef struct {
  coroutine* c;
  int max_x, max_y;
  int x, y;
} iter;

void iterate(void* p) {
  iter* it = (iter*)p;
  int x,y;
  for(x=0; x<it->max_x; x++) {
    for(y=0; y<it->max_y; y++) {
      it->x = x;
      it->y = y;
      yield(it->c);
    }
  }
}

#define N 1024

int main() {
  coroutine c;
  int stack[N];
  iter it = {&c, 3, 2};
  start(&c, &iterate, &it, stack+N);
  while(next(&c)) {
    printf("x=%d y=%d\n", it.x, it.y);
  }
}

This is somewhat uglier than Python, with an iter struct used for specifying the 2D range (max_x, max_y) as well as for generating the values (x,y). We also get to allocate a stack - and, in my simplistic example, we must actually know that stacks grow downwards on our platform (hence we pass stack+N instead of stack). Worse, C doesn't really give us a way to estimate how much stack space we'll need, or to grow the stack on demand.

But the flow is similar - start initializes a coroutine, next is used to get values, returning false when it runs out of values, and yield is called to jump back to the consumer.

We'll now walk through an implementation of this coroutine interface. The interface itself looks like this:

#include <setjmp.h>

typedef struct {
  jmp_buf callee_context;
  jmp_buf caller_context;
} coroutine;

typedef void (*func)(void*);

void start(coroutine* c, func f, void* arg, void* sp);
void yield(coroutine* c);
int next(coroutine* c);

A coroutine is thus two contexts: the callee's (producer's), and the caller's (consumer's). The tricky and non-portable part is initializing these contexts (jmp_buf's). The cleaner and portable part is jumping back and forth between them, so let's see that first:

#include "coroutine.h"

enum { WORKING=1, DONE };

void yield(coroutine* c) {
  if(!setjmp(c->callee_context)) {
    longjmp(c->caller_context, WORKING);
  }
}

int next(coroutine* c) {
  int ret = setjmp(c->caller_context);
  if(!ret) {
    longjmp(c->callee_context, 1);
  }
  else {
    return ret == WORKING;
  }
}

So yield and next are each just a pair of setjmp/longjmp calls. The trick with setjmp is, it returns 0 right after you call it - having saved the context (machine registers, basically) to your jmp_buf. You can later return to this context (restore register values) with longjmp.

Where exactly does longjmp bring you? To the point in code where setjmp has just returned. How do you know that setjmp has just returned because longjmp brought you there - as opposed to returning directly after you called setjmp? Because setjmp returns the non-zero value that was passed to longjmp, rather than 0 as it does when returning directly. So if(!setjmp) asks, "did I just call you - or was it a long time ago and we're back from longjmp?"

As you can see, there's almost no difference between yield and next - both say, "save our current context and then go to the other side's context". The only thing setting next apart is, it reports whether we're DONE or WORKING.

But, it looks like we're always WORKING, doesn't it? ret is what's passed to longjmp and yield always passes WORKING. Which brings us to start - the one passing DONE, because, well, the code of start also contains the part where we're finished.

typedef struct {
  coroutine* c;
  func f;
  void* arg;
  void* old_sp;
  void* old_fp;
} start_params;

#define get_sp(p) \
  asm volatile("movq %%rsp, %0" : "=r"(p))
#define get_fp(p) \
  asm volatile("movq %%rbp, %0" : "=r"(p))
#define set_sp(p) \
  asm volatile("movq %0, %%rsp" : : "r"(p))
#define set_fp(p) \
  asm volatile("movq %0, %%rbp" : : "r"(p))

enum { FRAME_SZ=5 }; //fairly arbitrary

void start(coroutine* c, func f, void* arg, void* sp)
{
  start_params* p = ((start_params*)sp)-1;

  //save params before stack switching
  p->c = c;
  p->f = f;
  p->arg = arg;
  get_sp(p->old_sp);
  get_fp(p->old_fp);

  set_sp(p - FRAME_SZ);
  set_fp(p); //effectively clobbers p
  //(and all other locals)
  get_fp(p); //...so we read p back from $fp

  //and now we read our params from p
  if(!setjmp(p->c->callee_context)) {
    set_sp(p->old_sp);
    set_fp(p->old_fp);
    return;
  }
  (*p->f)(p->arg);
  longjmp(p->c->caller_context, DONE);
}

This is longer, uglier, non-portable and perhaps not strictly correct (please tell me if you spot bugs - I have close to zero experience with x86-64, it's just what I test on at home.)

So why so ugly? Because we need a separate stack for the coroutine (the producer), and C doesn't have a way to give it a stack. And without a separate stack, our setjmp will save the registers alright - and then we go back to the consumer, and write happily to places on the stack.

By the time we longjmp back to the producer, all the local variables it kept on the stack (as opposed to in registers) may be long overwritten. That's why out of the box, setjmp/longjmp work fine as a try-catch/throw - but not as next/yield.

If, however, we:

Allocate space for a separate stack
Make the stack pointer point into that stack before setjmp
Restore the stack pointer after setjmp
Return to the caller (consumer)

…then we'll have created a jmp_buf with a stack pointer pointing to our separate space - jmp_buf remembers the stack pointer among the other registers. And then we can jump as we please back and forth between the producer and the consumer - they no longer overwrite each other's on-stack variables.

The actual stack switching is the ugly part of start. Once we accomplish that, and the producer calls next for the first time, bringing us past the if(!setjmp), all that is left is to call the entry point function and report that we're DONE:

  (*p->f)(p->arg);
  longjmp(p->c->caller_context, DONE);

The longjmp brings us back to a context saved by some call to next, and then next sees the DONE returned by setjmp and returns false to the consumer.

And now to the stack switching. Basically, all we need is some inline assembly to change $sp - the stack pointer register, something most machines have. There are two complications though:

Apart from $sp, there's also the frame pointer (at least in debug builds) - a register pointing into the stack and used to access local variables. So we have to switch that, too.
Our parameters, such as f and arg, may themselves be on the stack. So how do we access them after switching the stack?

Actually I think that the clean way around these is to write start in pure assembly - where you know where everything is and how to access it and no compiler puts it in different places under different build settings. But I don't like assembly, so instead I did this:

Use space right below the sp address (the stack grows downwards) to store our parameters (in a start_params structure). The beginning of this structure is kept in p. Together with the parameters, we save the stack pointer and frame pointer values:
```
start_params* p = ((start_params*)sp)-1;
p->c = c;
p->f = f;
p->arg = arg;
get_sp(p->old_sp);
get_fp(p->old_fp);
```
Set the frame pointer to p, and the stack pointer to p - FRAME_SZ (I picked a random constant for FRAME_SZ that I thought was large enough for whatever the compiler may want to keep on the stack - in bytes, I'm reserving 5*sizeof(start_params)). The idea is, the frame pointer is the highest address of our stack frame, and the stack pointer is the lowest address.
```
set_sp(p - FRAME_SZ);
set_fp(p);
```
Read the value of p back from the frame pointer register, because once we've changed the value of that register, we no longer trust any local variable value (because the compiler may have put it on the stack so it will try to access it through the frame pointer - which is no longer the same frame pointer that it has just used to store the variable, so it will get garbage.)
```
get_fp(p);
```
Be careful to access everything through p from on - and restore $sp and $fp before returning to the caller.

That's all; the get/set macros use the GNU inline assembly syntax to get/set the $rsp and $rbp registers of x86-64. The code worked with a bunch of compiler flags for me - but do tell if you see bugs in there.

So that's how coroutines work with setjmp/longjmp and a bit of inline assembly - hopefully, porting to another machine takes little more than rewriting the 4 asm macros.

Update: check out coroutines in standard C - no assembly whatsoever - by Tony Finch (the original article and a refined version not relying on C99 features). The trick is to allocate space for coroutine stacks on your own stack.

This approach may be more or less suitable for you than the kind of code appearing above depending on your situation (for instance, I wrote this stuff in a context where I had to allocate the stacks outside the caller's stack because of the OS stack protection features and my wish to yield in thread A and then call next in thread B).

1. Tanim IslamAug 20, 2013

In your python routine, should you not replace the "range" with "xrange," at least to preserve the spirit of using a pure python iterable?

2. John RegehrAug 20, 2013

This line looks like undefined behavior to me, but probably it is reliably compiled in practice:

start_params* p = ((start_params*)sp)-1;

3. Yossi KreininAug 20, 2013

@Tanim Islam: yeah, I just figured "range" is more mnemonic for someone not knowing Python and so reading it as pseudocode.

@John Regehr: very possibly it's undefined :) I'm not very good at C's semantics, really; I know I can carve out a struct out of a byte array in some cases but not others, and I was never completely sure which are which. With this type of code, I know I'll be able to maintain it as compilers/platforms change over the years but it's not because I'm trusting my formal understanding of C as much as because I trust that I can make sense of the assembly coming out of the compiler and how it's wrong and why. I wonder if start() in general is at all possible to defend in terms of C's semantics – for instance the fiddling with $sp and $fp, not to mention the FRAME_SZ constant.

BTW I discovered that on my platform for which I originally started writing this stuff, setjmp is broken in that it assumes 32b floating point registers on 32b processor variants; that's because the newlib version is from 1993 when there were just two variants of the machine – all registers 32b or all registers 64b, ints as well as floats. So the part which is supposedly correct according to C's semantics will need to be rewritten...

4. David JankeAug 20, 2013

also, xrange is not part of python 3... and range is now a Sequence

5. JonSep 6, 2013

Hi – this is cool. I'm going to play with it over the weekend. :)

I'm wondering if you've seen this other implementation of C coroutines? It probably has an uglier API, but I think it works by expanding macros into switch statements, which is possibly more portable, I'm not sure. It'd be interesting to see a comparison between the two.

http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html

6. Yossi KreininSep 6, 2013

Indeed it's more portable; the downside is, you can only yield at the top level of your coroutine – that is, the coroutine can't itself call functions and have one of them yield control to the producer. That's because calling functions that yield requires multiple call stacks.

7. Michael MoserSep 24, 2013

Thanks for the article,

Now there is another trick for solving lack of getcontext: if jmp_buf is an array of values then one can infer the index that stores the stack pointer and change jmp_buf directly.

Also: a very useful feature for coroutines is to have a guard page at the end of the stack; otherwise one will never know if the program ran out of the limited stack space (if one wants to service lots of coroutines, then that's what it gets)

8. Yossi KreininSep 24, 2013

You could change the sp (and fp) directly by knowing which libc you're targeting; you'd need to deal with the fact that your local variables could have been rendered invalid by that when comping back from setjmp through longjmp though.

As to guard pages – if you have paging, you probably have getcontext... Non-Unix operating systems might have paging or they could have other memory protection features; I'm on an OS which doesn't use paging, for instance.

9. SagiApr 20, 2014

For c++ programmers Boost appears to include Coroutines too

http://www.boost.org/doc/libs/1_53_0/libs/coroutine
/doc/html/coroutine/intro.html

10. kevinOct 27, 2014

Hi Yosef, Nice writeup! So, I have tried this and it works on MACOSX 64bit and Windows 32bit very nicely, but Win64/x64 inline asm is not allowed in visual c++.

Any ideas how to write the set/get_sp/fp functions on _WIN64 target? I'm trying MASM but I think the function wind/unwind stuff is getting in the way, I'm no ASM guru.

I tried the TonyFinch/Picoro stuff from his github and it's awesomely portable and simple to read, but limited by stack space. so I am preferring your method which allows me to create my own stack space on the heap, and patch it in to each coroutine...

Wishing I could use ucontext but it seems deprecated by POSIX group and broken on MACOSX (my code which works on linux hangs on MACOSX). And finally, windows Fibers seems limited to stack space as well, I get around 2028 coroutines... (and it isn't portable). I even found a ucontext implementation for windows.

Some work I did on top of ucontext, for the curious: http://www.subatomicglue.com/secret/coro/readme.html

11. Yossi KreininOct 27, 2014

No inline asm at all? How come?

My solution for Windows was actually to cross-compile with mingw, so I don't know how to do it in VS.

12. kevinOct 27, 2014

Right, visual C++ doesn't support inline ASM for x64 targets, I guess they didn't want to have it interfering with their optimizer

https://www.google.com/?gws_rd=ssl#q=visual+c%2B%2B+no+inline+assembly+x64

13. kevinOct 30, 2014

This page descries the rationale for deprecating ucontext: http://pubs.opengroup.org/onlinepubs/009695399/functions/makecontext.html

They think that making people port to threads (i.e. pthreads) is easier/better than writing a new interface for ucontext which conforms to C standards. I guess they leave us with no POSIX way to implement coroutines, which seems shortsighted to me.

14. Yossi KreininOct 30, 2014

To be fair (?), C has the problem of handling stack overflow in coroutines that is kinda solvable for the OS, especially if you have plentiful pointer bits (by trapping stack overflow and mapping more memory pages on demand), but not solvable in userspace on today's systems. Solved in Go by having a calling convention handling stack overflow, AFAIK... So deprecating ucontext is like saying, "we take away a portable method to implement a construct but not really", which can seem reasonable; "if you're willing to risk stack overflow then you should also be willing to fiddle with assembly." On the other hand, C always lets you shoot yourself in the foot... so why not make this shot easier is not really clear...

15. Jesse LactinMar 30, 2017

Hi, Yossi. Cool write up. Coroutines have really opened my eyes to how flexible and clean iteration can be (C++ iterators are put to shame for their verbosity and implementation concerns). I think your one page API could be made to look more like the Python example if you're a glutton for punishment and want to mix variable argument lists and setjmp/longjmp/makecontext. But I'm at a loss of how to go about it, as my experience with setjmp/longjmp is limited, and I can't reason about the resulting control flow. Do you have any ideas as to where I could improve my setjmp mastery?

16. Yossi KreininMar 30, 2017

I think setjmp/longjmp sink in once you fiddle with them for a few hours, in the same way recursion does. C varargs are different though, they're always ugly at the callee code and never safe. I think you need C++ templates to make this look pretty (as in, any coroutine prototype works and you call it naturally.)

17. Jesse LactinMar 31, 2017

Thanks for getting back to me so quickly. I was thinking of an API that adds varargs to next() and yield() like this.

void yield(coroutine *c, void *arg0, ...);
int next(corotuine *, void *arg0, ...);

start() would be left unchanged. If arg0 is NULL, the code would terminate vararg handling. This API would require a simple change to the coroutine struct (a fixed-size buffer of void *s, and void ** pointing the next slot). It's not the prettiest solution, but it seems to work. I have no PC right now, so I can't run GDB to figure out why it crashes. I'm stuck with online C compilers ATM.

Have you heard of libffi? It allows you to call any function when you only know the API at runtime. It's the perfect solution if for calling an arbitrary coroutine. Here's an example Simon Tatham wrote: http://www.chiark.greenend.org.uk/doc/libffi-dev/html/Simple-Example.html

18. Yossi KreininApr 1, 2017

The problem with varargs is you need some sort of a convention for figuring out which arguments were actually passed (printf has one hairy one which isn't suitable for anything except printf.) With these complications, I'm not sure the code is gonna be prettier or safer with varargs than with a hand-coded params struct, and maybe the reverse. A C++11 variadic template instantiating a tuple struct from the parameters and then calling the entry point functions with these arguments might be as clean as what you're looking for.

19. Jesse LactinApr 3, 2017

My convention for this particular function is that the list of args is terminated with a NULL pointer. You were right about setjmp; I get it now. One last note, I got the API changes working, but it required volatile-qualifying x and y in iterate(). It printed garbage without it, and it seems that it's UB otherwise.

20. Jesse LactinApr 10, 2017

I got it working. Here's the gist of it: https://gist.github.com/anonymous/e9a4f29a2e62452a36ba7488d7422242

I like this API a little more because it doesn't require me to make a struct. For mixing varargs and setjmp, I think it's a pretty clean interface. Once I got my head around setjmp (next() longjmps back into start(), and calls the entry point function), it all seemed to fall into place for me.

21. Yossi KreininApr 18, 2017

I guess I prefer almost anything to varargs except when I'm ultimately forwarding arguments for printf, because of how unsafe and unreadable accessing varargs is, but to each his own :-) I don't think my code makes a good library anyway, it's more of an example one would mold to suit their use case, which I guess is what you did there.

22. Jesse LactinApr 21, 2017

I think your right. Varargs are ugly and error prone. My modification asserts when not enough arguments are missing. I haven't worked out all the corner cases yet. Maybe explicitly passing the expected number of arguments is safer. I saw a simple Lisp implementation use varargs for constructing cons-cells. I thought it was an alright solution instead of having multiple constructor functions. Would you be interested in seeing the code on Gist? I could format, comment, and post it.

23. Jesse LactinDec 9, 2017

Got it working on Windows, using the OS's fiber functions.
Here it is on gist. https://gist.github.com/anonymous/765f189ddd13665256b140ade76d0533

Coroutines in one page of C

Post a comment