The bright side of dark silicon
It's been a decade or so since the end of frequency scaling, and multicore has become ubiquitous, there being no other means
to increase a chip's performance.
Some multicore systems are symmetric – all cores are identical, so you can easily move work from one core to another. Others
are asymmetric – as in CPU cores and GPU cores, where it's harder to move work between different types of cores.
Which is better – symmetric or asymmetric multicore?
Why symmetric is better
Three main reasons that I see:
- Better load balancing
- Less work for everyone
- More redundancy
Better load balancing
Asymmetric multicore makes load balancing harder, because a GPU can't easily yank a job from a queue shared with a CPU and
run that job. That's because some of those jobs are simply impossible to run on a GPU. Others run so badly that it's not worth
the trouble.
And those CPU codes that could run OK on GPUs would have to be compiled twice – for the CPU and the GPU – and even then you
can't make things like function pointers and vtables work (though I can imagine a hardware workaround for the latter – a
translation table of sorts; maybe I should patent it. Anyway, we're very far from that being our biggest problem.)
And then you need a shared queue between the CPU and the GPU – how does that work? – or you partition the work statically
(each of the 4 CPUs processes 10% of the pixels, the remaining 60% of the pixels go to the GPU cores).
But static partitioning, often quite lousy even with symmetric multicore, is awful with asymmetric multicore because how do
you choose the percentages? You need to know the relative strength of the cores at each task. How do you do that – dynamically
figure out the first time your program runs on a new device?
So this is all close to insane. What people actually do instead is task parallelism – they look at their different jobs, and
they figure out which should run on each type of core, and optimize each task for the respective core.
But task parallelism never load-balances very well. Let's say you look for faces in an image on the GPU and then try to
figure out whose faces these are on the CPUs. Then sometimes the GPU finds a lot of faces and sometimes just a few, taking
roughly the same time to do so. But the CPU then has either a lot of work or just a little. So one of them will tend to be the
bottleneck.
Less work for everyone
We actually touched on that above. If you wanted to do data parallelism, running the same task on all your cores but on
different subsets of the data, one problem would be to optimize your code for each type of core. That's more work. Someone at
the OS/system level would also need to help you with sharing task queues and vtables – still more work.
Generally, more types of core means more hardware design, more compilers, assemblers, linkers and debuggers, more manuals,
and more integration work from bus protocols to program loaders, etc. etc. And, for programmers, not only more optimization work
but more portability problems.
More redundancy
That's a bit futuristic, but I actually heard this argument from respectable people. The idea is, chip manufacturing yields
will significantly drop at, say, 8nm processes. And then your chance to get a chip without a microscopic defect somewhere will
become so low that throwing away every defective chip will be uneconomical.
Well, with symmetric multicore you don't have to throw away the chip. If the testing equipment identifies the core that is no
longer useable and marks the chip accordingly using fuses or some such (which is easy to do), an OS can then run jobs on all
cores but the bad one.
Nifty, isn't it?
With asymmetric multicore, you can't do that, because some type of work will have no core on which it can run.
Why asymmetric is inevitable
In two words – dark silicon.
"Dark silicon" is a buzzword used to describe the growing gap between how many transistors you can cram into a chip with each
advancement in lithography vs how many transistors you can actually use simultaneously given your power budget – the
gap between area gains and power gains.
It's been a couple of years since the "dark silicon"
paper which predicted "the end of multicore scaling" – a sad follow-up to the end of frequency scaling.
The idea is, you can have 2x more cores with each lithography shrink, but your energy efficiency grows only by a square root
of 2. So 4 shrinks mean 16x more cores – but within a fixed power budget, you can only actually use 4. So progress slows down,
so to speak. These numbers aren't very precise – you have to know your specific process to make a budget for your chip – but
they're actually not bad as a tool to think about this.
With 16x more area but just 4x more power, can anything be done to avoid having that other 4x untapped?
It appears that the only route is specialization – spend a large fraction of the area on specialized cores which are much
faster at some useful tasks than the other cores you have.
Can you then use them all in parallel? No – symmetric or asymmetric, keeping all cores busy is outside your power budget.
But, if much of the runtime is spent running code on specialized cores doing the job N times faster than the next best core,
then you'll have regained much of your 4x – or even gained more than 4x.
Gaining more than 4x has always been possible with specialized cores, of course; dark silicon is just a compelling reason to
do it, because it robs you of the much easier alternative.
What about load balancing? Oh, aren't we "lucky"! It's OK that things don't load-balance very well on these
asymmetric systems – because if they did, all cores would be busy all the time. And we can't afford that – we must keep some of
the silicon "dark" (not working) anyway!
And what about redundancy? I dunno – if the yield problem materializes, the increasingly asymmetric designs of today are in
trouble. Or are they? If you have 4 CPUs and 4 GPU clusters, you lose 25% of the performance, worse than if you had 12 CPUs; but
the asymmetric system outperforms the symmetric one by more than 25%, or so we hope.
So the bright side of dark silicon is that it forces us to develop new core architectures – because to fully
reap the benefits of lithography shrinks, we can't just cram more of the same cores into a same-sized chip. Which, BTW, has been
getting boring, boring, boring for a long time. CPU architecture has stabilized to a rather great extent; accelerator
architecture, not nearly so.
GPUs are the tip of the iceberg, really – the most widely known and easily accessible accelerator, but there are loads of
them coming in endless shapes and colors. And as time goes by and as long as transistors keep shrinking but their power
efficiency lags behind, we'll need more and more kinds of accelerators.
(I have a lot of fun working on accelerator architecture, in part due to the above-mentioned factors, and I can't help
wondering why it appears to be a rather marginal part of "computer architecture" which largely focuses on CPUs; I think it has
to do with CPUs being a much better topic for quantitative research, but that's a subject for a separate discussion.)
And this is why the CPU will likely occupy an increasingly small share of the chip area, continuing the trend that you can
see in chip photos from ChipWorks et al.
P.S.
I work on switching-limited chip designs: most of the energy is spent on switching transistors. So you don't have to power
down the cores between tasks – you can keep them in an idle state and they'll consume almost no energy, because there's no
switching – zeros stay zeros, and ones stay ones.
Chips which run at higher frequencies and which are not designed to operate at high temperatures (where high leakage would
become intolerably high – leakage grows non-linearly with temperature) are often leakage-limited. This means that you must
actually power down a core or else it keeps using much of the energy it uses when doing work.
Sometimes powering down is natural, as in standby mode. Powering down midway through realtime processing is harder though,
because it takes time to power things down and then to power them back up and reinitialize their pesky little bits such as cache
line tags, etc.
So in a leakage-limited design, asymmetric multicore is at some point no better than symmetric multicore – if the gaps
between your tasks are sufficiently short, you can't power down anything, and then your silicon is never dark, so either you
make smaller chips or programs burn them.
But powering up and down isn't that slow, so a lot of workloads should be far from this sad point.
P.P.S.
I know about GreenDroid, a project by people who make the "dark silicon leads to specialization" argument quite eloquently; I
don't think their specialization is the right kind – I think cores should be programmable – but that again is a subject for a
separate discussion.
P.P.P.S.
Of course there's one thing you can always do with extra area which is conceptually much easier than adding new types of
cores – namely, add more memory, typically L2/L3 cache. Memory is a perfect fit for the dark silicon age, because it essentially
is dark silicon – its switching energy consumption is roughly proportionate to the number of bytes you access
per cycle but is largely independent of the number of bytes you keep in there. And as to leakage, it's easier to
minimize for memories than most other kinds of things.
Another "lucky" coincidence is that you really need caches these days because external DRAM response latency has been 100 ns
for a long time while processor clocks tend to 50-200x shorter, so missing all the caches really hurts.
So it's natural to expect memories to grow first and then the accelerator zoo; again consistently with recent chip photos
where, say, ARM's caches are considerably bigger the ARM cores themselves.
(Itanium famously spent 85% percent of the chip area or so on caches, but that was more of "cheating" – a way to show off
performance relative to x86 when in fact the advantage wasn't there – than anything else; at least that's how Bob Colwell quoted
his conversation with Andy Grove. These days however it has become one of the few ways to actually use the extra area.)
> And those CPU codes that could run OK on GPUs would have to be
compiled twice – for the CPU and the GPU – and even then you can't make
things like function pointers and vtables work (though I can imagine a
hardware workaround for the latter – a translation table of sorts; maybe
I should patent it. Anyway, we're very far from that being our biggest
problem.)
I think AMD has beaten you to the punch, they now use a unified
address space on their latest GPUs. I believe the GPU uses the CPU's
MMU, any pagefaults are handled by the CPU.
A unified address space isn't enough – you still have different code
pointers if your code is compiled twice, so if you pass a function
pointer from a CPU to a GPU, it won't be able to call the function
without translation.
Is it feasible to use reversible computation in order to extend the
usefulness of switching-limited designs? Or does it have further
problems that make it not viable?
You mean http://en.wikipedia.org/wiki/Reversible_computing?
I'm not sure how it would help or what to do with it in general (the
simplest logic gates are logically irreversible).
The tradeoffs between some of the problems you indicate, such as load
balancing and task parallelism may be offset by the benefits of
specialized silicon, though, yes? Two symmetric CPUs in perfect load
balancing still won't process certain tasks as fast as a CPU/GPU with
the GPU doing 100% of the work, so there's still a net gain for that
task asymmetrically without expending work on parallelization
(admittedly pushing some hard work to the chip designers). As long as
real world tasks benefit from specialized silicon, though, I think
asymmetry is to be expected.
There are parallels here with a debate in the early 90's about how
best to support dynamic and object-oriented programming on the chips of
the day. Do we add new instructions to accelerate method dispatch, or
just add cache? The answer seems so obvious now I'm almost embarrassed
to admit the time I spent working with a chip designer on the former
approach (I was at Apple when it first got involved in the ARM). Of
course larger caches improved things dramatically and the ARM
instruction set was not extended in this way, but at the time we were
not so far removed from Lisp machines and so we had a bias toward custom
hardware.
Now we worry more about processing large data sets and less about
method dispatch (except that dynamic language performance has at last
become a hot topic). Again more cache helps, but specialization allows
better task parallelism (as you point out) so the answer will certainly
involve a mix of the two.
@Chipmonkey: it's true that you often gain from asymmetry due to
specialization more than you lose from it due to poor load balancing;
poor load balancing just makes the choice between symmetric and
asymmetric harder – if you can do symmetry, that is. Dark silicon makes
things easier because you simply can't do perfectly load balanced
symmetric systems, so the tradeoff is gone.
@Jim: you worked on OO extensions to ARM at Apple?! Cool stuff! I
always thought it doesn't help much simply because you still need to do
the same amount of memory indirections and so the number of times you go
through memory won't go down and that's your bottleneck to begin with;
but maybe it's simplistic and it'd be really interesting to hear details
of how it was supposed to work (legally disclosing this kind of thing is
of course often impossible, but it won't hurt that I asked...)
Is dark silicon compatible with Microsoft Visual Basic? I realise
that you can probably program the dark silicon with advanced languages
like C and Jabascript, but for me it is important that it can run Visual
Basic.
Thank you.
For information on Microsoft Visual Basic, please contact
Microsoft.
@Fenton: Dark silicon is invisible to the naked eye, so it can't be
programmed with so-called "visual" languages, like Visual Basic.
However, blind developers can often program it with a stick.
"Memory is a perfect fit for the dark silicon age, because it
essentially is dark silicon – its switching energy consumption is
roughly proportionate to the number of bytes you access per cycle but is
largely independent of the number of bytes you keep in there."
Isn't power consumption proportional to the distance which
information needs to travel? Besides when looking at power optimized
designs (eg. NVIDIA Kepler) cache sizes weren't too much increased.
Especially compared to increased size of whole hw (eg. x2 for NVIDIA
Kepler).
The way it works with memories is, roughly, yes, distance is
important and you can't have a "deep" monolithic memory because of that.
So if you need to fetch 32b per cycle, and you want 4M of memory, you
have a large bunch of smaller banks of memory, and you use some of the
address bits to select the bank, and then fetch your bytes from that
bank. The reason your power doesn't grow much is most of these banks are
deactivated. One of them might produce data that then must travel a long
distance and it costs; but that's offset by not activating all that
other stuff at this cycle. Basically memory is cheap power-wise because,
well, it doesn't compute anything, it just remembers.
As to what NVIDIA did in a specific chip generation – I'd need to
look deeper to try to guess their rationale than being able to state a
few general trends which should hold over time across a large set of
designs.
Interesting article, especially combined with the next one on
FPGAs.
Any thought as to whether FPGAs could be made part of the dark
silicon balance, one that could tightly integrate with the CPU and GPU
cores via signal lines between them and the FPGA switching fabric, and
perhaps direct accessability of CPU/GPU registers from the FPGA?
Also, you mentioned that asynchronous is the way to go, but a talk
from a few years ago http://www.youtube.com/watch?v=KfgWmQpzD74 suggested
that both asynchronous and dynamic cores were useful. The dynamic cores
would be regular CPUs run at different frequencies, say 200MHz, 800MHz,
and 2GHz, and you would orchestrate turning these on and off to match
your power budget. They couldn't all be turned on at once, but provided
you had some idea of the relative cost of different calculations and the
dependencies between them you could schedule the optimal mix of cores
for your load and assign them appropriately. This couldn't be done
perfectly, of course, but a Pareto result could be good enough.
It has been interesting to me that the schedulers across cores you
now see in erlang VMs (for erlang processes) and golang (for goroutines)
could well be the way to realize that orchestration some day.
@cmccabe
That's not a nice thing to say. It should be possible to program dark
silicon in braille.
I think dynamic frequency helps with energy (saving it when you can)
but not power (energy spending rate); that is, it helps spend less when
you aren't under heavy load, but if you're continuously under heavy
load, and therefore dissipating a lot of power without being able to
slow anything down, you need asymmetry or you'll burn.
Direct accessibility of registers from the FPGA – I don't think so;
integration on the same die – I hope to post soon how I see it, I think
it's not going to be in mass produced chips any time soon though it'd be
nice if it were.
There's a separate issue from performance on the horizon, and that's
availability. If you assign a task to a processor and that task becomes
idle, it's certainly a lot faster to start up than if the task is paged
out to memory. I think that "suspending" cores when a task indicates
that it is idle may be a way of utilizing the extra cores without
requiring that they be executing code continuously. This leads to a
processor model with very many less-capable cores.
I guess there's value to it in scenarios with a ton of concurrency
where you constantly context-switch (network processors tend to be big
clusters with many, many "wimpy" cores for that reason I believe.) On
end-user machines I'd be surprised if the cost of context switching
dominated to that large an extent.
Your "more redundancy" argument isn't so futuristic, it's quite the
case on the PS3. The Cell architecture has 8 SPUs, but the PS3's specs
require only 7 SPUs, thus the yield is increased.
Imho it was a nice try, but was reverted to a more regular
architecture on the PS4. So on one hand, we are encouraged to develop
new architectures, and on the other hand, it seems we are not embracing
new architectures...
Regarding dynamic frequency, I think it can help the power
consumption under heavy load, although it will take awhile before we
have that kind of infrastructure.
Imagine a program like this:
A = f() // f() task runs 10) { //Succeeds 75% of the time
C = A + B
} else {
C = B
}
B usually takes five times as long as A does, and C always depends on
B and usually on A. A task scheduler that kept JIT-like stats could
parallelize A on a 200MHz core and B on a 1GHz core, and have A's result
before B's most of the time with less power used over all.
Also, perhaps the dynamic and asynchronous views aren't mutually
exclusive. A slower core could have a much simpler in implementation in
silicon, meaning far lower power and much less space required. So as
transistor density goes up a package might have 64 cores at 200MHz, 16
at 800MHz, and only 4 at 2GHz. Assuming each core frequency level
represents 100% of the power budget, a highly parallelizable program
might get better performance scheduled across 32 200Mhz, 4 800MHz, and 1
2GHz cores rather than running all 37 tasks across the 4 2GHz cores.
Anyway, it is all speculative for now.
Looks like something messed up my code. Let me try again (hoping that
a little HTML is allowed)
A = f() — f() task runs within 20ms 90% of the time
B = g() — g() task runs within 100ms 90% of the time
if (B < 10) — happens 75% of the time
then C = B
else C = A + B
Hopefully all the text will show up this time, and even if the line
break tag isn't accepted perhaps you will be able to see it and do the
line break manually in your head.
"And those CPU codes that could run OK on GPUs would have to be
compiled twice — for the CPU and the GPU — and even then you can't make
things like function pointers and vtables work ..."
Surely if your code relies on making calls through function pointers
or vtables, it's not going to "run OK on GPUs" in the first place, is
it? I say this as a total noob in the world of GPGPU, but I thought
branches were bad, in which case "load this pointer and jump through it"
must be far worse.
Well, GPUs evolve, and then it's a question of how much divergence
you have; if everyone branches to the same place then branches are not
so bad, I believe.
I didn't really mean to focus on GPUs, I just chose them as a popular
example of a core which you can use but which is not another symmetric
CPU. I wrote more specifically about GPUs here: http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
Very interesting article.
I used to integrate accelerators into custom SoC chip and the software
become really nasty. Not sure what kind of specialization can save the
day, like to see the result.
also intel released their 22nm products to the market and it seems
the dark silicon issue doesn't matter much actually. I've talked with
IBM and TSMC guys about 14nm/16nm process and it seems still leakage is
the biggest headache.
...but leakage is the problem leading to dark silicon!
That's why you can't lower the voltage, not? Which in turn is why you
can't get power efficiency improvements beyond ~L while your area
improvement is L^2.
You say Intel released 22nm products and you say dark silicon doesn't
matter. You mean they got Dennardian performance improvements – that is,
1.4x the frequency and 2x the core count relative to the previous
node?
If you look at newer nodes' stats, don't you see a gap between the
area improvements and the power improvements? Leakage is the root cause
but the gap is the result, not?
What I heard is they implemented *very* aggresive power gating
mechanism that power down blocks whenever possible to save energy.
Indeed lots of silicon turns "dark" frequently. But is it a problem?
Or is it a goal?
Well, it's both, right? As long as you can shut down things you
aren't using anyway you're achieving a goal; once you're forced to shut
down things you'd very much like to use – like half your cores whatever
those cores are – then you're facing a problem.
@ Brian Balke (#14),
Yes availability is an interesting thought area especialy from a
security perspective.
As is often said the number of "bugs" in code is proportional to the
lines of code writen not the computing power of individual lines. Thus
the higher the level language the more productive code with less bugs
and the faster it will be written. Taken to a logical conclusion you end
up with *nix type shell scripting where each line of code in effect
calls an applet.
Thus whilst "code cutter" level programers carve out lots of
production code using scripting, the applets are written by security
aware engineer level programers using an engineering aproach.
You thus have hundreds of wimpy cores which in effect run applets.
Script pipelining is done through IPC mechanisums working through "main
memory" whilst applets use memory local to the core.
You have a number of specialised cores that act as hypervisors, these
control the wimpy cores in a number of ways. One specific area is each
wimpy core uses the equivalent of a reduced function MMU through which
IPC happens. However unlike a conventional setup it is not the
associated wimpy core that controls the MMU but the hypervisor.
In essence the wimpy core is "jailed" behind the MMU and the applet
has know knowledge of the rest of the system. Further the hypervisor can
limit the amount of local memory an applet has to work with and further
halt the wimpy core and inspect the local memory. Thus malware has
little or no space to exist, and no knowledge of the system in general
nore does it have a sense of time to establish covert channels between
wimpy cores.
Further cores can be run in parellel but fed the same data and using
the same algorithms but developed differently. Wimpy cores are then run
in triplets and their outputs compared in a voting protocol. If the
three outputs agree then it is unlikely there is a fault or malware in
the three cores. If however there is a difference then there is either a
fault or malware which the hypervisor can look for. Importantly malware
would have to appear simultaniously on all three wimpy cores
simultaniously, and with a little thought you will see this is not
possible for externaly injected malware thus it will get flagged up.
Likewise malware at this level cannot be introduced by the "code
cutters" only the "applet engineers" which can be limited if not removed
by using seperate teams to develop the three different implementations
of each applet and appropriate formal methods.
can you please let me know the techniques to reduce dark
silicon???
Post a comment