Mircea Baja @ 20 March 2020 # Advanced Introduction to Multithreading - Advanced in that it assumes audience used/heard of threads - Introduction in that the subject is deep and complex - "we're doing this thing that we shouldn't be doing, is it all right if we keep on doing it?" --- # Agenda - Compiler re-ordering - Demo - DAG - CPU Memory model - "The sample" - Speculative execution - CPU data dependency ordering - Weak and strong memory models - TSO model for x86 - Why memory fences - Threading before C++11 - C++11 data races - Multiple writes mythical defence - Integer data race - Boolean data race - C++11 standardeese - synchronises with - Mutex - Thread creation - Thread completion - The volatile myth - Mention the thread sandwich pattern --- class: large-points # Demo - Single threaded Fibonacci - Exponential version takes "0 milliseconds" --- # Fibonacci ```cpp #include
#include
int fib(int x) { if (x < 2) return 1; return fib(x - 1) + fib(x - 2); } int main() { auto start = ::GetTickCount(); int result = fib(42); auto stop = ::GetTickCount(); std::cout << result << '\n'; std::cout << "It took: " << (stop - start) << '\n'; return 0; } ``` ```cpp 433494437 It took: 0 ``` ??? > --- # Generated code ```asm 12: auto start = ::GetTickCount(); mov esi,dword ptr [__imp__GetTickCount@0 (09B3000h)] push edi call esi mov edi,eax 14: auto stop = ::GetTickCount(); call esi 13: int result = fib(42); mov ecx,29h 14: auto stop = ::GetTickCount(); mov esi,eax 13: int result = fib(42); call fib (09B1000h) mov ecx,28h mov edx,eax call fib (09B1000h) ``` - reordered (not for "external" functions) - interleaved --- # DAG ![Image](../assets/2019-10-23-compiler-reordering/graph.png) - undefined (but maybe deterministic, unfeasable though for any but the most trivial cases) --- # The CPU memory model sample - memory at `x` and `y` starts at 0 - one thread: ```asm mov [x],1 ; store 1 at the location x in memory mov eax,[y] ; load value from location y in memory to register eax ``` - another thread: ```asm mov [y],1 ; store 1 at the location y in memory mov ebx,[x] ; load value from location x in memory to register ebx ``` - what are the possible outcomes for values in `eax` and `ebx` -- - they can also end up 0 --- class: large-points # CPU issues - Speculative execution - Data dependency ordering (DEC Alpha) - Weak and strong memory models - writes can be reordered ahead of other reads - writes can be reordered ahead of other writes - reads can be reordered ahead of other reads - reads can be reordered ahead of other writes (x86) - x86 ... ARM/PowerPC ... DEC Alpha - "Weak/strong" memory model - it's a simplification --- # TSO for x86 ![Image](../assets/2019-10-25-cpu-memory-model/tso.png) - Lock free is not synchronization free --- # Memory barrier ![Image](../assets/2019-10-25-cpu-memory-model/mem-fence-motivation.png) - write-release: e.g. ensure that all writes are performed BEFORE the write to the flag. - read-acquire: e.g. ensure that all reads are performed AFTER the read of the flag. - sometimes explicit, sometimes implicit --- # Threading before C++11 ```cpp DWORD WINAPI ThreadFn(LPVOID) { return 0; } int main() { int result = fib(39); auto m = ::CreateMutexW(NULL,NULL,NULL); result = fib(40); auto h = ::CreateThread(NULL, 0, ThreadFn, &result, 0, NULL); result = fib(41); ::WaitForSingleObject(m, INFINITE); result = fib(42); ::ReleaseMutex(m); ::WaitForSingleObject(h, INFINITE); std::cout << result << '\n'; return 0; } ``` ??? > --- # DAG ![Image](../assets/2019-10-29-threading-before-cpp11/graph.png) - can't provide lightweight synchronization primitives without compiler support - historic struggles for Linux kernel locks, Java VM garbage collectors - when the compiler performs optimisations, threading cannot be implemented as a library: it requires compiler support/guarantees --- # Without appropriate synchronization, access to the same data from multiple threads, with at least of one of them writing, is undefined. - C++11 data race - The language memory model is very weak (think DEC Alpha) allowing scope for optimisations - Just because it seems to work, it does not mean that it really works and it’s not just a bug waiting to happen --- # Yes but myth ![Image](../assets/2019-10-31-cpp-data-race/not-atomic.png) --- # Integer data race ```cpp // Given two `int` variables `x` and `y` set initially to `0`: // Thread 1: if (x != 0) { ++y; } // Thread 2: if (y != 0) { ++x; } ``` --- # Integer data race - part 2 ```cpp // Thread 1: ++y; if (x == 0) { --y; } // Thread 2: ++x; if (y == 0) { --x; } ``` --- # Boolean data race ```cpp // given a `bool` variable set initially to `false` // Thread 1: while (!stop) { // do work } // Thread 2: stop = true; ``` --- # Boolean data race - part 2 ```cpp // Thread 1: if (!stop) { while (true) { // do work } } // Thread 2: stop = true; ``` --- # Mutex locking - `unlock()` synchronizes with any subsequent `lock()` operations that obtain ownership of the same `std::mutex` ![Image](../assets/2019-11-02-cpp11-synchronizes-with/mutex.png) --- # Thread creation - the completion of the invocation of the `std::thread` constructor synchronizes with the beginning of the invocation of thread function ![Image](../assets/2019-11-02-cpp11-synchronizes-with/creation.png) --- # Thread completion - the completion of the `std::thread` synchronizes with the corresponding successful `join()` return ![Image](../assets/2019-11-02-cpp11-synchronizes-with/join.png) --- # Volatile ![Image](../assets/2019-11-05-cpp11-volatile/mappedio.png) - generate reads or writes to memory addresses for assignments to `volatile` qualified variables, instead of optimising them away - preserve the order of assignments to `volatile` qualified variables, ensuring they are not reordered --- # The volatile myth ```cpp int a = 0; int b = 0; bool flag = false; void producer_thread() { // write a and b a = 42; b = 43; // set the flag at the end flag = true; } void consumer_thread() { // wait for the flag to be set while (!flag) continue; // the flag was set: use a and b ... } ``` - "Just" qualify the `flag` as `volatile` --- # Developer hopes ![Image](../assets/2019-11-05-cpp11-volatile/mem-fence-motivation.png) --- # Reality ![Image](../assets/2019-11-05-cpp11-volatile/without-mem-fence.png) --- # Another volatile example ```cpp void fn(unsigned int value) { value = 65535U / value; // calculation cli(); // disable interrupts (cli = clear interrupt) global_variable = value; // assignment sei(); // enable interrupts (sei = set interrupt) } ``` ![Image](../assets/2019-11-05-cpp11-volatile/gccavr.png) --- # Volatile in practice - Java and C# - volatile is atomic and adds full memory fences - In C++11 volatile does not guarantee memory barriers for non-volatile access - Microsoft Visual C++: `/volatile:ms` OR `/volatile:iso` - Meaning of volatile changed over time ... and will continue to do so --- # The thread sandwich - keep it (relatively) simple in order to be able to reason about correctness - if you need all the info in this thread to reason about multithreading correctness, you've probably over-extended - different applications have very different threading (and error handling) requirements - the thread sandwich pattern splits threading code into 3 components (control/queue, function body, underlying thread) ![Image](../assets/2019-11-12-thread-sandwich/sandwich.png) https://bajamircea.github.io/coding/cpp/2019/11/12/thread-sandwich.html --- # Questions?