If one big complex thread being executed too fast is risky as we saw, because there’s a physical limit to sheer clock speed and cheating optimisations are unsafe, then, we all know the other option: what about doing many simple and small things independently? To put it differently: if we cannot make the clock any faster, nor any essentially sequential stream of instructions, why don’t we just put many actors and many roles to work together instead?
In future posts, I want to explore different models of computation apart from the Turing Machine, which will show different models of semantics and programming paradigms. But today, after going on for at length reviewing our modern models of hardware, I want to set some ideas about how different and perhaps futuristic architectures away from the Von Neumann might look like.
In Chisnall’s article’s from a few posts back on the C language, the researcher presents us with a few final ideas on how to imagine a non‑C processor, together with a set of further reads and bibliography.
Amdahl’s Law
It has been said many times, by me and by everyone else. We’re going to make our computers faster by adding more cores. But, no discussion of speed-up in parallel computation would be complete without reference to Amdahl’s Law 1.
$latex {\displaystyle S={\frac {1}{(1‑p)+{\frac {p}{n}}}}} &bg=fffef6$
where $latex S$ is the speedup that can be achieved when parallelizing a fraction $latex p$ of a program across $latex n$ cores. It’s easy to see then, that to achieve higher speedup, we require a lower denominator, which means that we need a higher parallelisable proportion $latex p&bg=fffef6$ to reduce the first factor, and allow the second factor to be reduced as $latex n&bg=fffef6$ grows.
$latex {\displaystyle \lim_{n \to \infty} S={\frac {1}{1‑p}}} &bg=fffef6$
But the model fails for its simplicity: it doesn’t take into account the price of context switches, and the contention of resources among threads. An extended version was presented by Eyerman and Eeckhout 2, including a contention probability $latex P_{ctn} &bg=fffef6$, and dividing the parallelisation factor into two factors: $latex p_{ctn} &bg=fffef6$ that contends, and $latex p_{fr} &bg=fffef6$ that doesn’t, where $latex p=p_{ctn} + p_{fr}&bg=fffef6$ and the sequential part is $latex s=1‑p&bg=fffef6$. In this new model, as the number of cores grows, the equations has the following behaviour:
$latex {\displaystyle \lim_{n \to \infty} S={\frac {1}{s + p_{fr} \cdot P_{ctn}}}} &bg=fffef6$
Hence, we’re required not only to increase the parallelisable part $latex p&bg=fffef6$, but also to reduce the probability of contention $latex P_{ctn}&bg=fffef6$, and also the price $latex p_{ctn}&bg=fffef6$ paid for it. A method to achieve this is more fine-grained abstractions: the finer the sequential stream of instructions whilst augmenting the number of streams, the more parallelised the program is, while contention is reduced.
Fine-grained software
Software communication
Contention is essentially a by-product of shared resources in the wide sense: because a given resource is shared among more than one process, its usage needs to be arranged to ensure collaboration without mutual harm. Programs spanned across threads will need access to eventually limited resources, and to some form of intercommunication. Two major models are presented: the Shared Memory Model, and the Message Passing Model.
Shared Memory
In this model, processes communicate across defined boundaries of shared memory. Two processes have access to the same memory frame, and mechanisms to ensure that data races don’t occur must be provided: locks, conditional variables, and atomics. This is the model that almost every major language follows (C/C++, Java, JS, Python, everything), and it is what makes cache coherency so complicated: if a given shared memory line is cached by two cores and one of them modifies it, a protocol to keep all caches in sync needs be designed. This model is easy to extend from a sequential V.N. concept, where a process can point, fetch and push to the whole memory pool, and its stream of execution is essentially sequential. A different abstraction than the locking one can be used though: Transactions, explained below.
Message passing
In contrast, in the message passing model, where processes are strictly isolated from one another, and communication can only happen through specified channels of message passing. By enforcing this, there’s no possibility over contention, though a process might be forced to wait for a message to arrive.
Lightweight threads
In current implementations, threads are known to be expensive, carrying a lot of state along. If a program is to be divided into arbitrarily many threads, two problems arise: first, context switches must be cheap, and second, thread scheduling must be efficient. While OSes must provide lighter abstractions, often provided by third runtimes at user level (C++ runtime, JVM, and whatnot), much research is being done on cheaper context switches like Lazy Context Switch or cached Virtual Memory Maps.
STM 3
Locking mechanisms are an effective and simple way to resolve resource contention problems, but their complexity doesn’t scale as the number of threads scale, specially when used in a fine-grained manner. Instead, Transactional Memory is a lock-less old technique coming from the database world, where accesses to shared regions are declared as transactions, and any writing transaction executes optimistically (believing it’s the exclusive owner of the resource), but if it finds conflicts at runtime, it can backtrack, not committing the transaction, and executing again. A runtime, responsible for this administration, needs hardware support to be efficient 4.
Fine-grained hardware
Architectural communication
In any given multi-core chip, communication between cores is essential. The standard implementation in reign today is the shared bus, were all communications happen through one global shared bus. Its major advantage is its simplicity: only one thing may travel through at a time, hence naturally giving a strong ordering of messages, and communications are broadcast, removing the need for addressing protocols. But this method doesn’t scale with increasing core counts, as the traffic volumes increases and the bus becomes a bottleneck, and with growing chip size, the bus wires extend farther in distance than what a single clock cycle can reach, and the energy requirements quickly hit the power wall.
Rather, Networks-on-Chip had been designed in a way as our internet works: it consists on a collection of routers, each one connected to a number of others. Messages are sent across the network until they reach their destination. Routing protocols, coexistence of physical networks with virtual ones, or network buffering, are all on-going fields of research. Furthermore, NoCs are a perfect foundation for an in-hardware message passing support and more efficient and correct cache coherency protocols.
SIMD
Single Instruction Multiple Data is a technique to exploit data level parallelism rather than concurrency: if we have for example an array of bytes, and we want to add 1 to each one of them, a SIMD instruction would apply simultaneously the $latex +1 &bg=fffef6$ operation to many bytes in the array. This is essentially useful for applications like graphic processing, where we often want to add, say, a bit of blue, to every single pixel in the picture. Support for this kind of execution is extensive in modern hardware, but vectorization of legacy code is often not possible.
Cache coherency
The cache coherency protocol is one of the hardest parts of a modern CPU to make both fast and correct. Most of the complexity involved comes from supporting a language in which data is expected to be both shared and mutable as a matter of course. Consider in contrast an Erlang-style abstract machine, where every object is either thread-local or immutable (Erlang has a simplification of even this, where there is only one mutable object per thread). A cache coherency protocol for such a system would have two cases: mutable or shared. A software thread being migrated to a different processor would need its cache explicitly invalidated, but that’s a relatively uncommon operation.5
Hardware Threading
As we mentioned, the overhead of the OS scheduler combined with a context switch is generally too high to scale across increasing core counts. Proposals in the academia has been made, for example to introduce hardware support of task queues that cores can enqueue, including priority queues; to make context switches lazy (currently supported in Floating-point operations in most standard Intel chips); to make cores able to handle arbitrarily high numbers of threads at once, therefore reducing the frequency of context switches.
Further Reads
Most of the ideas in this post are extracted from Chisnall and Chadwick’s papers, each of which quotes a lot more bibliography on their own. While I’m not an expert in on-going hardware research, I think it’s important to raise awareness of the problems and promises the field undergoes. After all, you need some hardware to run your programs, and questions like how decoupled could they be from each other, how any given semantics should model a machine, and how your hardware can make your language better, are important to every computer fan.
- https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-832.pdf [§3.1]
- Stijn Eyerman and Lieven Eeckhout. Modeling critical sections in Amdahl’s law and its implications for multicore design.
- https://www.schoolofhaskell.com/school/advanced-haskell/beautiful-concurrency/3‑software-transactional-memory
- Cavascal et al. Software Transactional Memory.
- https://queue.acm.org/detail.cfm?id=3212479
[…] serious problems to scale, and new models are being developed and popularised, like Transactional Memory or the Message Passing. And if you’ve heard they’re slow, remember, locking is fast just […]
[…] Next, we will explore some of the ideas that academia and research are currently exploring. In the meantime, yes, once again, xkcd has the best conclusion to the thought: […]