The C++ Sucks Series: the quest for the entry point
Suppose you run on the x86 and you don't like its default FPU settings. For example, you want your programs to dump core when
they divide by zero or compute a NaN, having noticed that on average, these events aren't artifacts of clever numerical
algorithm design, but rather indications that somebody has been using uninitialized memory. It's not necessarily a good idea for
production code, but for debugging, you can tweak the x86 FPU thusly:
//this is a Linux header using GNU inline asm
#include <fpu_control.h>
void fpu_setup() {
unsigned short cw;
_FPU_GETCW(cw);
cw &= ~_FPU_MASK_ZM; //Divide by zero
cw &= ~_FPU_MASK_IM; //Invalid operation
_FPU_SETCW(cw);
}
So you call this function somewhere during your program's initialization sequence, and sure enough, computations producing
NaN after the call to fpu_setup result in core dumps. Then one day someone computes a NaN before the call to fpu_setup,
and you get a core dump the first time you try to use the FPU after that point. Because that's how x86 maintains its "illegal
operation" flags and that's how it uses them to signal exceptions.
The call stack you got is pretty worthless as you're after the context that computed the NaN, not the context that got the
exception because it happened to be the first one to use the FPU after the call to fpu_setup. So you move the call to fpu_setup
to the beginning of main(), but help it does not. That's because the offending computation happens before main, somewhere in the global object construction sequence. The order of execution of the
global object constructors is undefined by the C++ standard. So if you kindly excuse my phrasing – where should we shove the
call to fpu_setup?
If you have enough confidence in your understanding of the things going on (as opposed to entering hair-pulling mode), what
you start looking for is the REAL entry point. C++ is free to suck and execute parts of your program in "undefined" (random)
order, but a computer still executes instructions in a defined order, and whatever that order is, some instructions
ought to come first. Since main() isn't the real entry point in the sense that stuff happens before main, there ought to be
another function which does come first.
One thing that could work is to add a global object to each C++ translation unit, and have its constructor call fpu_setup();
one of those calls ought to come before the offending global constructor – assuming that global objects defined in the same
translation unit will be constructed one after another (AFAIK in practice they will, although in theory the implementation
could, for example, order the constructor calls by the object name, so they wouldn't). However, this can get gnarly for systems
with non-trivial build process and/or decomposition into shared libraries. Another problem is that compilers will "optimize
away" (throw away together with the side effects, actually) calls to constructors of global objects which aren't "used"
(mentioned by name). You can work around that by generating code "using" all the dummy objects from all the translation units
and calling that "using" code from, say, main. Good luck with that.
The way I find much easier is to not try to solve this "portably" by working against the semantics prescribed by the C++
standard, but instead rely on the actual implementation, which usually has a defined entry point, and a bunch of functions known
to be called by the entry point before main. For example, the GNU libc uses a function called __libc_start_main, which is
eventually called by the code at _start (the "true" entry point containing the first executed instruction, AFAIK; I suck at
GNU/Linux and only know what was enough to get by until now.) In general, running `objdump -T <program> | grep start`
(which looks for symbols from shared libraries – "nm <program>" will miss those) is likely to turn up some interesting
function. In these situations, some people prefer to find out from the documentation, others prefer to crawl under a table and
die of depression; the grepping individuals of my sort are somewhere in between.
Now, instead of building (correctly configure-ing and make-ing) our own version of libc with __libc_start_main calling the
dreaded fpu_setup, we can use $LD_PRELOAD – an env var telling the loader to load our library first. If we trick the loader into
loading a shared library containing the symbol __libc_start_main, it will override libc's function with the same name. (I'm not
very good at dynamic loading, but the sad fact is that it's totally broken, under both Windows and Unix, in the simple sense
that where a static linker would give you a function redefinition error, the dynamic loader will pick a random function of the
two sharing a name, or it will call one of them from some contexts and the other one from other contexts, etc. But if you ever
played with dynamic loading, you already know that, so enough with that.)
Here's a __libc_start_main function calling fpu_setup and then the actual libc's __libc_start_main:
#include <dlfcn.h>
typedef int (*fcn)(int *(main) (int, char * *, char * *),
int argc,
char * * ubp_av,
void (*init) (void),
void (*fini) (void),
void (*rtld_fini) (void),
void (* stack_end));
int __libc_start_main(int *(main) (int, char * *, char * *),
int argc,
char * * ubp_av,
void (*init) (void),
void (*fini) (void),
void (*rtld_fini) (void),
void (* stack_end))
{
fpu_setup();
void* handle = dlopen("/lib/libc.so.6", RTLD_LAZY | RTLD_GLOBAL);
fcn start = (fcn)dlsym(handle, "__libc_start_main");
(*start)(main, argc, ubp_av, init, fini, rtld_fini, stack_end);
}
Pretty, isn't it? Most of the characters are spent on spelling the arguments of this monstrosity – not really interesting
since we simply propagate whatever args turned up by grepping/googling for "__libc_start_main" to the "real" libc's
__libc_start_main. dlopen and dlsym give us access to that real __libc_start_main, and /lib/libc.so.6 is where my Linux box
keeps its libc (I found out using `ldd <program> | grep libc`).
If you save this to a fplib.c file, you can use it thusly:
gcc -o fplib.so -shared fplib.c
env LD_PRELOAD=./fplib.so <program>
And now your program should finally dump core at the point in the global construction sequence where NaN is computed.
This approach has the nice side-effect of enabling you to "instrument" unsuspecting programs without recompiling them s.t.
they run with a reconfigured FPU (to have them crash if they compute NaNs, unless of course they explicitly configure the FPU
themselves instead of relying on what they get from the system.) But there are niftier applications of dynamic preloading, such
as valgrind on Linux and .NET on Windows (BTW, I don't know how to trick Windows into preloading, just that you can.) What I
wanted to illustrate wasn't how great preloading is, but the extent to which C++, the language forcing you to sink that low just
to execute something at the beginning of your program, SUCKS.
Barf.
Corrections - thanks to the respective commenters for these:
1. Section 3.6.2/1 of the ISO C++ standard states, that “dynamically initialized [objects] shall be initialized in the order
in which their definition appears in the translation unit”. So at least you have that out of your way if you want to deal with
the problem at the source code level.
2. Instead of hard-coding the path to libc.so, you can pass RTLD_NEXT to dlsym.
Good stuff. I'd be interested to know how to do this on Windows; I'm
going to research that at some point, and maybe I'll post about it on my
blog if and when I succeed.
I'm currently doing battle with a static library that uses global
objects with non-trivial constructors that allocate memory, which
doesn't interact too nicely with our memory manager. We're currently
using lazy initialization of the memory manager, but shutting it down
and checking for leaks on exit is problematic – not even using atexit()
works, since sometimes some global objects manage to register their
destructors to run after the memory manager shuts down and *BOOM*.
The man himself wrote a whole section about the problem of global
object initialization order in "The Design and Evolution of C++" but
doesn't really offer a solution.
I realize you have to continue serving your function as a high level
C++ critic but your post really doesn't have much to do with C++,
though, does it? What you want is better control over the linker/loader
which is really an OS thing. I'm not even sure if you can cover all
cases because some other process might have already loaded some of the
dynamic libraries.
You might want to look into the GNU linker's –wrap function. It looks
like it does exactly what you want, though I don't know that it will
work in a low level function like you are looking for. If it *does* work
then you at least can launch without the LD_PRELOAD env var stuff.
"info ld invocation options" to bring up the info page then search
for "wrap".
Dynamic loading is not broken. It has well defined semantics. Read up
on dlopen() and dlsym() in the Single Unix Specification v2 or v3. For
example, there is no need to dlopen() libc explicitly in your code, you
should just use dlsym() with RTLD_NEXT as handle.
@queisser: "I realize you have to continue serving your function as a
high level C++ critic but your post really doesn’t have much to do with
C++, though, does it? What you want is better control over the
linker/loader which is really an OS thing."
No, C++'s ability to execute static constructors before main() really
does suck, and IMO really does exacerbate the problem of finding the
"real entry point" of a non-trivial program. It's true that even C
programs execute a lot of code before main() — code that initializes the
heap, sets up stdin and stdout, and whatnot — but that code is generally
written by trusted sources(TM) and can basically be ignored when
debugging. C++ allows Joe Random Programmer to insert code before
main(), which is far, far worse.
Now, I'd argue that the answer is Don't Do That. If yosefk had had
the foresight and (probably more critically) authority to enforce a
project-wide coding standard that forbids the definition of any static
object with a constructor or destructor, then he wouldn't have had to
hack around the problem this way. (Hindsight is 20/20, yeah.)
Nice article. One thing to consider trying is rather than hardcoding
the path to libc, you can use the pseudo-handle RTLD_NEXT in your call
to dlsym().
.NET executables have a traditional entrypoint that jumps to
mscoree.dll!_CorExeMain which kicks off managed code execution, so I
don't think this counts as magic of the same order as LD_PRELOAD.
Regarding global constructors + custom memory manager: yeah, been
there, two. Since it was on an embedded target, I ended up adding the
memory manager initialization as another hack to the already hacked libc
startup code.
Regarding RTLD_NEXT – thanks for the tip. Regarding "dynamic loading
not being broken" – "defined behavior" isn't the opposite of "broken
behavior". When I say "broken", I mean (1) that it's not "The Right
Thing" (and there would be some hubris here if we weren't discussing
something as trivial as detecting redefinition, where The Right Thing is
damn easy to define), and (2) the fact that actual compilers out there
generating shared objects produce output compatible with the spec of
shared objects doesn't make that output compatible with the spec of
their source language (try throwing a C++ exception from one .so file
and catch it in a caller function located in another .so file and you'll
get the idea.)
Regarding this not being a C++ issue – as mentioned above, it is. Do
you have a problem doing something "at the [real] beginning" of your C
or Lisp or Python program?
Regarding my presumed duty to criticize C++ – um, how do I put this.
I get to use the fucking shit a lot. When I no longer do, people will
have to get C++ hate elsewhere.
FYI
> assuming that global objects defined in the same
> translation unit will be constructed one after another
> (AFAIK in practice they will, although in theory the
> implementation could, for example, order the
> constructor calls by the object name, so they
> wouldn’t)
Section 3.6.2/1 of the ISO C++ standard states, that "dynamically
initialized [objects] shall be initialized in the order in which their
definition appears in the translation unit"
Thanks!
I should update the article to quote this.
I don't know about Linux, but dyld on Mac OS X lets you delcare a
function with the "constructor" attribute, i.e.
void do_something(void) __attribute__((constructor))
It'll get called by the dynamic linker before entering main(). But
I'm not sure how it gets called in relation to C++'s static
constructors. They get called after your program's image has been
loaded, since you can initialize globals in said constructors.
Would it have been possible for you to do the FPU tweaks and then
re-exec(2)? If the FPU control word is set on a per-address space basis,
that should do the trick.
@Damien: interesting, I didn't think about either. I'd guess
__attribute__((constructor)) simply adds the address of the function to
the .init section, so it gets called at some undefined point during the
pre-main initialization sequence. Regarding exec(2) – I don't know
whether the FPU mask is supposed to survive that, but it's a pretty
violent measure – for example, if someone prints before main(), the text
will be printed twice, etc. That is, it's basically OK to do this only
if you assume that the program's init sequence is "tame" enough – in
which case you wouldn't have the trouble of taming it in the first
place, or something.
Why not add a floating point operation to the end of fpu_setup, like
x=1.0+2.0; (with appropriate un-optimization settings) Then at least if
you have a pending exception you will get your core dump when fpu_setup
is called, not at some unknown later time when another function tries to
do f.p.?
You'd stll have to debug your constructors without benefit of core
dumps on NaNs, but seems like fair trade for not having to trick the
loader into doing something it doesn't want to do, and how many
constructors need to do f.p., anyway?
@Matt: It's a good idea to amend fpu_setup with an fp operation to
save the head scratching when the program fails upon its first attempt
to use fp elsewhere. However, I vigorously reject the claim that
debugging global constructors is reasonably easy without core dumps at
the point of failure :) Seriously, a 1/5M-1M LOC program uses what,
500-2000 translation units? Each can instantiate globals, which can have
constructors calling constructors ad nauseam, and this shit can depend
on getenv or files or the command line (accessible via stuff like the
/proc/ file system.) How am I going to shovel through all that, and
where do I even start?
Of course I don't recommend to use the LD_PRELOAD shite in production
environments, only for debugging.
Instead of doing it dynamically, you can do it statically. Compile
your fpu_setup in a separate .o adding
asm(".section .initncall fpu_setup");
to it, then make sure you pass this as the first thing to the linker.
Nothing is guaranteed by the language, of course, but the C run-time
it's built on is pretty reliable.
Dynamic linking semantic of ELF is horrible indeed, Windows is much
better though. It does not silently pick up the first definition — each
dll is in it's own namespace and you explicitly specify from which dll
you want your symbol.
And btw, inserting static constructor into each .cpp file (that
includes some header file) is the classic trick used to initialize
iostream library "before anything else", as you're allowed to use it
from any static constructor.
This relies on the standard initialization order in one translation unit
and absense of "optimizations". C++ compilers are not allowed to
optimize away any constructors, static or not, as hell knows what it can
do inside. The only thing it can do is "inline and dissolve".
I've had the pleasure to witness the iostream initialization trick
under the unfortunate circumstances of running on butt-slow targets (RTL
simulators.) Fun.
However, I distinctly remember gcc "optimizing" away global
constructors – which meant I had to give up on "automatic registration"
(where you have a global object that adds itself to some map before main
to register a library with a framework – you know the drill.) That only
worked if the global was defined in a .cpp whose .o was passed directly
to the linker; archiving the .o into a .a caused it to be optimized
away. (Now that I think of it, perhaps "touching" the global in the
library code would help.) This seemed completely broken exactly because
static constructors can have side effects, which in this case they
actually did, so it seemed like a broken build, but I double-checked and
found no way around this.
If it works in .o, but not in .a it obviously has nothing to do with
gcc optimizing it. gcc has created the code, you can see it in .o, so
it's done it's job. It does not know or care what you're doing with
.o.
This is a standard static linker behaviour. Were you putting static
constructor in it's own .o inside .a? This won't work, because linker
won't pick .o from .a, unless something it already picked needs some
symbol, defined in that .o. This has nothing to do with what constructor
does or whether it's "used" at a language level (e.g. you can use it
from the same .o, it won't help).
Nothing was ever updated for C++ in this scheme, and in fact it's not
even clear how it should work, because .a has semantics of a bunch of
independent .o files that can be picked at will, not an all-or-nothing
module. As it's typical for C++, noone cares, you're supposed to figure
this yourself.
iostream trick works because it only needs to initialize itself if
something uses it. As it inserted itself into all .os if any of them
get's picked the code will run. If not, this means noone's going to use
iostream in this program. If you have a header file that's included into
every .cpp (that uses floating point operations) in your program, and
you're willing to recompile the whole thing, you could add fpu_setup in
static constructor in that header. Same thing.
On the original subject. In presence of dynamic libraries _start is
no longer the entry point. Now it's buried inside dynamic linker. You
can exploit this with preloading much simpler. When ld.so loads dynamic
library it calls library's _init function. You can either try using this
directly or just write a static constructor. That is, static constructor
in preloaded library runs before anything in your program.
And while at it, valgrind does not use preloading (it wouldn't be
able to do half of what it's doing), it uses dynamic binary
compilation.
By "gcc optimized it away", I didn't mean "gcc the compiler, as
opposed to the assembler and the linker, did it in an optimization pass
as opposed to the linkage phase", I meant "the GNU C++ implementation" –
the whole toolchain which is supposed to implement C++, which in this
particular case it doesn't.
valgrind uses preloading in order to begin its dynamic binary
compilation.
Is it possible to write a stub program that calls fpu_setup() and
then just exec(2)s the main program? That seems easier, and since it's
the same PID to the OS it seems like the FPU settings should
survive.
Perhaps the FPU CSR would survive exec() – didn't check it; actually
the asm(”.section .initncall fpu_setup”) trick mentioned in a comment
above is probably the easiest way to do it as that way, you simply have
the FPU configured before application code is called as things should be
without having to tweak the way someone runs the code.
The POSIX exec() spec sez:
"The state of the floating-point environment in the new process image
or in the initial thread of the new process image shall be set to the
default."
http://www.opengroup.org/onlinepubs/000095399/functions/exec.html
Come up with non-buggy code before complaining about C++ – your
function should've reset the fpu state before setting the flags.
Actually what you should have done is... wait for it... not do
floating point calculations in global constructors. I don't know why
this trips up so many people. Global initialization order is undefined
like you said, so why are you doing anything in global constructors int
he first place? Make them global pointers and new them at the beginning
of main, problem solved.
@Zach: it wasn't my code that ran before main.
This is not a problem with C++, this is a problem of BAD code. Code
heavily depending on global objects is bad – period.
Even worse if there is no documentation which global objects is
constructed – If there is, simply add breakpoints in those constructors,
and you will find your bug.
And if you really need global objects, dont use any global objects
with constructors – rather call an init-function at the beginning of
main.
If you dont have control over main, have only 1 global object with
constructor, which constructs the others in defined order AND document
it.
I dont see why one would complain that bad code produces bad
problems... Just dont use such code, write good code instead / fix
it.
You'd be surprised how much effort was expended by the C++ community
to make the order of initialization and destruction sequences
deterministic and "correct" to some approximation (in the lack of an
explicitly defined order as well as an inclination to think that such
order should exist at all). So it's not "just" a bad code problem, it's
a language-specific cultural problem interacting with a loose language
spec.
On the other hand, if you like C++ and the one of its subcultures
you're dealing with the most, then I'm sincerely happy for you.
I don't get why you need to wrap around __libc_start_main(), would it
be just easier to invoke
fp_setup() in the pre-loaded shared lib?
E.g.
fplib.c:
int foo()
{
fpu_setup();
return 0;
}
int bar = foo();
(will need to build with g++ instead of gcc to allow non-const
initializer element for initializing bar.)
It could work, I guess; I don't know how loading work enough to tell
if bar=foo() is the first thing that will happen but perhaps it will
be.
There is an elegant solution to this, I think I read it in
Alexandrescu's Modern C++ Design.
You need a .h file that's included from all the .cpp files (with
VisualC++ you typically have this anyway, because of using precompiled
headers).
Make a class, which increments a static variable in its contructor. When
this if first incremented, do the initialization you need. Make a static
global variable of this class at the beginning of the header file. (This
will result in each translation unit having a copy, but your
initialization code will only run once and as the first thing, because
within translation units the order of initialization is the order of
declaration.)
Cool stuff at your page there.
I know this method; I wouldn't call it "elegant", but to each his
own. One "objective" issue with it though is, it doesn't work if you use
a library with global constructors and you have its header files and
object code but no source code. Whereas in C (or Python or a whole lot
of other languages) you simply have your first executing statement, the
second etc. etc. even if you can't recompile all the code that you're
using from source.
Post a comment