Mircea Baja @ 11 October 2022 # Advanced Introduction to Multithreading --- class: large-points # Scope - Advanced in that it assumes audience used/heard of threads - Introduction in that the subject is deep and complex - Many threading bugs are due to incorrect assumptions about how source code is actually executed - "What not to do" is as important as "What to do" - "We're doing this thing that we shouldn't be doing, is it all right if we keep on doing it?" --- class: large-points # Agenda - Compiler issues - CPU issues - Threading before C++11 - C++11 data races - C++11 standardeese - The volatile myth - C++11 solutions --- class: large-points # Compiler issues -- - **Single threaded** Fibonacci - Exponential version claims to take "0 milliseconds". What? --- # Fibonacci ```cpp #include
#include
int fib(int x) { if (x < 2) return 1; return fib(x - 1) + fib(x - 2); } int main() { auto start = ::GetTickCount(); int result = fib(42); auto stop = ::GetTickCount(); std::cout << result << '\n'; std::cout << "It took: " << (stop - start) << '\n'; return 0; } ``` ```cpp 433494437 It took: 0 ``` ??? > --- # Generated code ```asm 12: auto start = ::GetTickCount(); mov esi,dword ptr [__imp__GetTickCount@0 (09B3000h)] push edi call esi mov edi,eax 14: auto stop = ::GetTickCount(); call esi 13: int result = fib(42); mov ecx,29h 14: auto stop = ::GetTickCount(); mov esi,eax 13: int result = fib(42); call fib (09B1000h) mov ecx,28h mov edx,eax call fib (09B1000h) ``` - Reordered (not for "external" functions) - Interleaved - This is not a compile bug --- # DAG  - Undefined (but maybe deterministic, unfeasible though for any but the most trivial cases) --- # CPU Issues --- class: large-points # Dekker and Peterson's algorithms - Are well known academical algorithms for thread synchronization - They have a common pattern of two flags - One thread sets one flag to indicate intention to access critical section, then check the other flag - Another thread does the same with the flags switched around --- # Assembly code sample - Memory at `x` and `y` starts at 0 - One thread: ```asm mov [x],1 ; store 1 at the location x in memory mov eax,[y] ; load value from location y in memory to register eax ``` - Another thread: ```asm mov [y],1 ; store 1 at the location y in memory mov ebx,[x] ; load value from location x in memory to register ebx ``` - What are the possible outcomes for values in `eax` and `ebx`? -- - They can also end up 0 - If the loads are started before the store to memory completes - A CPU write buffer is all is needed to create this kind of issues --- # TSO for x86  - TSO = Total Store Ordering - Lock free is not synchronization free --- class: large-points # CPU issues - Speculative execution - Data dependency ordering (DEC Alpha) - Weak and strong memory models - writes can be reordered ahead of other reads - writes can be reordered ahead of other writes - reads can be reordered ahead of other reads - reads can be reordered ahead of other writes (x86) - x86 ... ARM/PowerPC ... DEC Alpha - The exact issues depend on processor - "Weak/strong" memory model - it's a simplification --- # Dependencies - Some are easy e.g. read pointer, write data to the value of the pointer - But others need to be made explicit by the developer - E.g. the developer has in mind the following DAG:  - How is the CPU supposed to figure out the dependencies between accesses to different memory locations (a, b and the flag)? --- # Threading before C++11 -- ```cpp DWORD WINAPI ThreadFn(LPVOID) { return 0; } int main() { int result = fib(39); auto m = ::CreateMutexW(NULL,NULL,NULL); result = fib(40); auto h = ::CreateThread(NULL, 0, ThreadFn, &result, 0, NULL); result = fib(41); ::WaitForSingleObject(m, INFINITE); // \ / result = fib(42); // | ::ReleaseMutex(m); // / \ ::WaitForSingleObject(h, INFINITE); std::cout << result << '\n'; return 0; } ``` ??? > --- # DAG  - Can't provide lightweight synchronization primitives without compiler support - Historic struggles for Linux kernel locks, Java VM garbage collectors - When the compiler performs optimisations, threading cannot be implemented as a language library (e.g. like a JSON parser library): it requires compiler support/guarantees --- # C++ data race --- # Without appropriate synchronization, access to the same data from multiple threads, with at least of one of them writing, is undefined. - C++ data race, introduced in C++11 - The language memory model is very weak (think DEC Alpha) allowing scope for optimisations - Also just because you could take advantage of the weak memory model, most of the use cases should not (it's time consuming to ensure correctness) - Just because it seems to work, it does not mean that it really works and it’s not just a bug waiting to happen --- # Yes but myth  --- # Integer data race ```cpp // Given two `int` variables `x` and `y` set initially to `0`: // Thread 1: if (x != 0) { ++y; } // Thread 2: if (y != 0) { ++x; } ``` --- # Integer data race - part 2 ```cpp // Thread 1: ++y; if (x == 0) { --y; } // Thread 2: ++x; if (y == 0) { --x; } ``` --- # Boolean data race ```cpp // given a `bool` variable set initially to `false` // Thread 1: while (!stop) { // do work } // Thread 2: stop = true; ``` --- # Boolean data race - part 2 ```cpp // Thread 1: if (!stop) { while (true) { // do work } } // Thread 2: stop = true; ``` --- class: large-points # Boolean data race - part 3 - "Yes but we're not checking the bool in a loop and a bool is just true or false" - The compiler does all sort of optimisations, including storing the value of a variable in multiple locations: it might have different values at different locations - Getting the right value is only part of the problem, the order of operations also matters --- class: large-points # Boolean data race - part 4 - "Then why is our code working as expected anyway?" - Undefined behaviour might just happen to do what you wanted - If your code is complex and makes a lot of opaque calls (like the OS calls we've seen above) the compiler might fetch the boolean whenever is needed and not cache it. - You might get away with bugs, but it's still a bug waiting to happen when code is refactored, the compiler gets improved etc. --- class: large-points # Memory barriers - A general term for a mechanism that provides guarantees with regards to optimisations that could impact threading - They need to go all the way: compiler + hardware - Sometimes explicit, sometimes implicit - Some are two way, some are one way e.g. - write-release: e.g. ensure that all writes are performed BEFORE this write - read-acquire: e.g. ensure that all reads are performed AFTER this read - It becomes complicated quickly: ordering of barriers themselves is required to achieve e.g. sequential consistency (i.e. as if interleaved) - See [Herb Sutter: atomic Weapons](https://www.youtube.com/watch?v=A8eCGOqgvH4) 2012 --- # What do to then? --- # Mutex locking - `unlock()` synchronizes with any subsequent `lock()` operations that obtain ownership of the same `std::mutex`  - Use in conjunction with a `std::condition_variable` to notify --- # Thread creation - the completion of the invocation of the `std::thread` constructor synchronizes with the beginning of the invocation of thread function  --- # Thread completion - the completion of the `std::thread` synchronizes with the corresponding successful `join()` return  --- # The volatile myth --- # Volatile  - generate reads or writes to memory addresses for assignments to `volatile` qualified variables, instead of optimising them away - preserve the order of assignments to `volatile` qualified variables, ensuring they are not reordered --- # The volatile myth ```cpp int a = 0; int b = 0; bool flag = false; void producer_thread() { // write a and b a = 42; b = 43; // set the flag at the end flag = true; } void consumer_thread() { // wait for the flag to be set while (!flag) continue; // the flag was set: use a and b ... } ``` - "Just" qualify the `flag` as `volatile` --- # Developer hopes  --- # Reality  --- # Another volatile example ```cpp void fn(unsigned int value) { value = 65535U / value; // calculation cli(); // disable interrupts (cli = clear interrupt) global_variable = value; // assignment sei(); // enable interrupts (sei = set interrupt) } ```  --- class: large-points # Volatile in practice - Java and C# - volatile is atomic and adds full memory fences - In C++11 volatile does not guarantee memory barriers for non-volatile access - Microsoft Visual C++: `/volatile:ms` OR `/volatile:iso` - Meaning of volatile changed over time ... and will continue to do so --- class: large-points # Next steps - Avoid data races - Keep it (relatively) simple in order to be able to reason about correctness - If you need absolutely all the info in this thread to reason about multithreading correctness, you've probably over-extended - Different applications have very different threading (and error handling) requirements --- # Questions? --- # Coding links --- # Class lifetime https://bajamircea.github.io/coding/cpp/2015/04/02/class-lifetime.html
allocate
memory
construct
base
Base1
construct
base
Base2
construct
member
a
construct
member
b
init
vtable
pointer
SomeClass
constructor
body
usage
deallocate
memory
destruct
base
Base1
destruct
base
Base2
destruct
member
a
destruct
member
b
revert
vtable
pointer
SomeClass
destructor
body
--- # Spot the bug https://bajamircea.github.io/coding/cpp/2015/04/08/threading-bug.html --- # The thread sandwich https://bajamircea.github.io/coding/cpp/2019/11/12/thread-sandwich.html 