Imagine a restaurant. A professional chef, in charge of the best pizzas in town, and nobody knows how’re they done. It’s such a secret, those magic ingredients. And imagine a bunch of apprentices, supposed to help him. Imagine them following the chef’s orders. And then see, the chef going to the fridge to take some fresh cheese. The apprentice isn’t sure if the next pizza will need tomato sauce or if it will be a white pizza, so in the meantime, he takes the magic sauce from the drawer and prepares it in the table. Chef is back and says “no, no, this pizza is white, I even brought our secret cheese for it!”. The apprentice then saves the tomato sauce back to the drawer, and continue impassive. But a drop of the secret sauce was spilled in the table.
And a spy hired by the competition is there just in time to take that drop, vessel in, back to the lab for examination.
A modern concern
This is, essentially and over-simplified, what happens when you mix branch prediction in speculative execution, with the L3 cache being global to all cores. Your L3 cache is that big table, your chef is the program control, your assistants are the cores, including that one spy, who has access to the same table all the others have, and your assistants go on take the tomato sauce out of habit, even before knowing if they’ll need it or not.
You might have heard it or not. Last January, what is said to be the worst CPU bugs in history 1 was discovered, independently, from several research team, mainly Google’s Project Zero and E.U. funded projects in Europe. Reported secretly to Intel, The Linux Foundation, and Microsoft, and published when some fixes were ready to deployment 2.
The issues, fantastically described in a dedicated website, affects virtually every single Intel processor alive today, and a vast amount of AMD and ARM processors as well. And the fixes are, to say the least, expensive performance-wise.
For decades, the chip industry has been focused on performance. Nothing to blame, it’s precisely and nothing else than what the market asks them to. But some of the tricks that have made them fast, are precisely the ones that have made them unsafe all of a sudden. Some people have gone as far as saying that these performance improvements are merely a quest to make C faster, and not to make chips faster. And for the most part, regaining that safety comes at the price of giving up a lot of their gained performance.
But what are these vulnerabilities all about? Let’s analyse the cases.
Last time we saw that, to enhance the memory bottleneck, architectures build a layered memory model. While the main memory is living on its own piece of hardware, integrated with the chip there is a cache memory bucket that serves the fast accesses. As we also saw, there’re three layers: L1 and L2 are private to each core 3, while L3 is shared across all cores and threads. Every time a core attempts to fetch from memory, it checks each level, and if in the end it needed to go down all the way to main memory, it copies the found data all the way up, exploiting the locality we saw last time.
This has an important hurdle 4: a second core can measure the time it takes him to read a given address from memory by flushing the cache and reading it repeatedly; if the access was too fast, it means that some other core has already accessed this info, thereby leaking information of what other cores are doing.
Enter the Address Space
Because Memory is one giant pool where everything 5 lives together, Memory needs be virtualised to keep processes away from messing each other. And because every program has no idea where the Operating System will load him, programs are designed to believe that they’ll always be loaded at address zero and have the entire memory space for them. So in reality, memory is virtualised: the OS keeps a mapping of addresses for each process, between what the process believes to have, and what only the OS knows about their actual location. In this virtual address space, for performance reasons, the OS maps its own kernel at the end of it
Here it is important to know what a context switch is: in any standard OS these days, the processor is continuously changing its context, doing a bit of every process every few milliseconds, therefore achieving progress everywhere and avoiding one process from locking the processor. But changing from one process to another is not that easy: before doing so, you have to hide away from the second process all the information of the first, and as well keep this information safe so the first process can continue exactly where it was. This is done by the kernel: a context switch means the following transition: “Process A -> kernel switching mechanism -> process B -> kernel …”.
Besides, often a process needs services from the kernel, like message passing between other processes, or memory accesses, or hard drive reads… For this reason, the kernel always maps itself within the private address space of every process, so servicing requests or switching contexts, which are already expensive, can be made a bit cheaper by saving memory accesses and far jump instructions.
And this kernel memory is of course protected, so only the kernel can read itself.
So now we know everything we need to know to understand the problems.
In the official paper, we can find this revealing “Toy Example”:
The problem is then crystal-clear: considering out-of-order execution, the second instruction does indeed get executed, even if, semantically, was unreachable. The raised exception gives control to the kernel, which will then terminate the program, and out-of-order execution ensures that operations that shouldn’t have been executed won’t be committed, and no architectural signs will be seen; however, this has a micro-architectural effect: the fetching from memory gets loaded in cache, which is not flushed due to performance reasons. Furthermore, the access, if to kernel memory, was illegal and will raise an Access Violation Exception, but this is not triggered immediately, because it has to still go through the pipeline, long behind the actual violation, because a core cannot waist time waiting for exceptions or whatnot.
From here on, cache is a global unprotected memory for all cores, where, by just timing memory access, another core can see the loaded address, and by some tricky pointer arithmetic, it can see its contents. Repeating the process, it can dump the entire kernel, without any kernel privileges. And if you have access to the kernel, you have access to the entire memory map. Yes, including your browser’s tab checking your bank accounts, and all your saved passwords.
Immediately after acknowledgement, all major OSes patched their kernels. Basically, Meltdown can make any process read its entire address space, including the protected part. Therefore, we shouldn’t map the entire kernel in the process“ space, right? But there are necessary pieces that need be mapping, like “what is needed to enter/exit the kernel such as the entry/exit functions, interrupt descriptors (IDT) and the kernel trampoline stacks. This minimal set of data can still reveal the kernel’s ASLR base address; but, this is all trusted, which makes it harder to exploit.” 7
Meltdown does not attack any software vulnerability, he just bypasses the hardware protection. Hence, any software patch will always leave some small vulnerable memory surface. But disabling out-of-order execution, or flushing the cache continuously, or serialising all permission checks before fetches, would all involve a significant overhead, sometimes reported as high as a 50% slowdown.
Spectre, according to the official site, is called as it is because “The name is based on the root cause, speculative execution. As it is not easy to fix, it will haunt us for quite some time.”
Consider the following code:
if (x < array1_size)
y = array2[array1[x] * 4096];
In a case like this, the pipeline has to decide whether to execute the array check, or not. We can easily imagine giving the pipeline successful inputs of x many times, called the training phase, and then, giving one incorrect and maliciously crafted x. The branch predictor will then push the fetch into the pipeline, and if we can arrange array1_size and array2 to be out of cache (by [ccie]clflush[/ccie]-ing it), and we can make array1[x] be cached beforehand, then ordering the fetch would be a lot faster than comparing the values, and therefore the pipeline wouldn’t be flushed in time to prevent the execution of the fetch.
And then we’re back to cache side-attacks territory: we’ve successfully violated memory accesses, and make the CPU serve into cache the value we want exactly when we asked him to, therefore making it too easy to leak it to our interests.
Spectre has many variants, which makes it all worse. We can poison branch prediction, but we can also poison indirect jumps, that is, a jump to an address that needs be calculated first, as the speculative execution engine remembers as well previous jumps and speculates where the next jump will be. There’s also the variant where, even if speculative execution does not modify the cache whatsoever, the state of the cache still affects the timing of speculatively executed code, revealing “metadata” of the state, or even variants that don’t involve the cache, or victim code without branches nor jumps (the interrupt services from the OS are still provoking jumps anyway).
JIT compilers, the web worker feature of HTML5, Virtual Machines, and old compiled code, are all vulnerable. And new Spectre variants are being discovered all the time 9.
Hoping to keep this post short, I’ll just enumerate some of the implemented fixes to this day: in software, pointers can be secretly xored to poison them, hence the attacker would read garbage, and bounds checking could be replaced by index masking. In hardware, “future processors can track whether data was fetched as a result of a speculative operation”, branch predictors could not only not be shared across cores, but also across threads within the same core, and at last, there’s the very well respected Google’s solution retpolines.
What do we do?
Next, we will explore some of the ideas that academia and research are currently exploring. In the meantime, yes, once again, xkcd has the best conclusion to the thought:
- Please enjoy Linus Torvalds“ opinion on the fixes :D
- L1 is actually modelling a Harvard Architecture, an interesting variation of the Von Neumann’s one.
- Not any news, Evict-Flush-Prime are known Cache Side-Channel Attacks methods since a long time ago.
- Everything, including your browser visiting your bank account!