Mircea Baja - 6 May 2025 # Coroutines problem domain
--- # Concurrent vs. parallel - intuitively illustrated using abstract pictures - for intro into C++ standardese doublespeak see ["Forward Progress Guarantees in C++ - Olivier Giroux - CppNow 2023"](https://www.youtube.com/watch?v=g9Rgu6YEuqY&t=978s) --- # Sequential
# Parallel
- there is additional overhead (not illustrated), uses more hardware resources --- # Sequential
# Concurrent
- purple and yellow end later - there is additional overhead (not illustrated) --- # Sequential
# Concurrent and parallel
- purple ends later - there is additional overhead (not illustrated), uses more hardware resources --- # What are coroutines? --- # Function vs. coroutine
--- # Coroutines
--- # Stack
--- # Stackful
--- # Stackless
--- # Duality
--- # Use cases for concurrency --- # Compilers (maybe)
source
lexical
analysis
syntactical
analysis
data
analysis
instruction
generator
code
- Melvin Conway: Design of a separable transition-diagram compiler - of Conway law fame: product design mirrors organisation structure - motivation: single pass compiler in memory constraints environments, each "coroutine" can output zero or more than one results each time it's invoked - in the meantime: better ways of designing a compiler --- # Simulations - Bjarne Stroustrup: [A Set of C++ Classes for Co−routine Style Programming.](https://www.softwarepreservation.org/projects/c_plus_plus/cfront/release_e/doc/ClassesForCoroutines.pdf) Bell Laboratories Computer Science Technical Report CSTR−90. November 1980. - motivation: event driven simulations (e.g. Bjarne's Ph.D. thesis work in Cambridge UK) - `task` base class used to represent a independent activity which: - suspends voluntarily - can be resumed, canceled - can wait, sleep - provides an `int` result and communicates with other `task`s (e.g. producer and consumer, server reading from a queue) - work is done in the derived class constructor - one of the earliest libraries in C++ (think `complex` and `string` without templates), yet did not make it to the standard - some of the problems it tries to solve turn out to be recurring: cancellation, time management, forking and joining, queues, debugging, stack size, memory overhead etc. --- # While down memory lane (1) [A History of C++: 1979− 1991 Bjarne Stroustrup (1995)](https://www.stroustrup.com/hopl2.pdf) "was used to write the library that supported the desired styles of concurrency. Please note that ‘‘styles’’ is plural. I considered it crucial – as I still do – that more than one notion of concurrency should be expressible in the language. This decision has been reconfirmed repeatedly by me and my colleagues, by other C++ users, and by the C++ standards committee. There are many applications for which support for concurrency is essential, but there is no one dominant model for concurrency support; thus when support is needed it should be provided through a library or a special purpose extension so that a particular form of concurrency support does not preclude other forms." - this is another recurring theme: there are all sort of ways to do concurrency --- # While down memory lane (2) [A History of C++: 1979− 1991 Bjarne Stroustrup (1995)](https://www.stroustrup.com/hopl2.pdf) "C with Classes could not provide benefits at the expense of removing ‘‘dangerous’’ or ‘‘ugly’’ features of C. This observation/principle had to be repeated often to people (rarely C with Classes users) who wanted C with Classes made safer by increasing static type checking along the lines of early Pascal" - this is another recurring theme: e.g. drive to use Rust "for safety" is not a grass-roots movement from the C++ users/developers --- # GUI ```cpp BOOL bRet; while ( (bRet = ::GetMessage(&msg, hWnd, 0, 0)) != 0) { if (bRet == -1) { // handle the error and possibly exit } else { ::TranslateMessage(&msg); ::DispatchMessage(&msg); } } ``` --- # GUI
GetMessage
PostMessage
PostThreadMessage
--- # GUI - usually single threaded: a single dedicated UI thread - the thread has a queue - the thread is blocked waiting for a message from the queue - when a message is returned, the thread processes it via `switch` statements - messages are added to the queue via `PostMessage` (from the same thread) or `PostThreadMessage` (from a different thread) - there is also a special "quit" message - GUIs show the usage of queue of work to manage concurrency, of a blocking function to get next piece of work, not necessarily in need of coroutines (often long background activities are scheduled on additional threads) --- # Async IO - motivation: C10K (1999) - networking is often IO bound (wait for data to be sent or received) - what's wrong with `WaitForMultipleObjects` (more later) - `O(N)` time complexity with the number of handles provided - similar `select` for Linux - IO completion ports: it's a queue again (more later) - similar `kevent` for Linux - (boost) ASIO - file IO - timers/events/registry monitoring - one problem: thread(s) blocked using API to get data from the queue, can't have a thread check for work in multiple incompatible queue types - Windows thread pools to unify work covered by IO completion ports e.g. networking, with work not covered via IO completion ports - similar evolution on other OSes --- # WaitForMultipleObjects - when a thread calls `WaitForMultipleObjects`, the caller provides a set of objects it waits for - for each object in the set, the OS has to atomically (thread safe) add this thread ID to a list associated to the object - the thread is suspended - when an object in the set is signaled, the OS finds the thread ID in the object's list - the thread is scheduled to resume - when the thread resumes - for each object in the set, the OS has to atomically remove this thread ID from the list associated to the object - `WaitForMultipleObjects` returns - The steps above lead to `O(N)` time complexity with the number of objects provided, which is why `WaitForMultipleObjects` limits it to 64 --- # Completion port
GetQueuedCompletionStatus x2
--- # Completion port - when a thread calls `GetQueuedCompletionStatus` - it checks if work is queued (requires a lock) - if not, it suspends, a single object needs to have this thread ID stored - when an object associated with the completion port is signaled - it's added to the queue - the thread is scheduled to resume (if needed) - when the thread resumes - it picks work from the queue - `GetQueuedCompletionStatus` return - it solves the `O(N)` problem, work to queue/dequeue one item does not depend on the number of items in the queue - the disadvantage is that most such queueing systems are incompatible, a thread is blocked on reading from a particular queue system --- # How about C++11 mechanisms? - `std::thread`: too much resources (e.g. stack) to just serve a single/few operations - `std::async`: similar, also poorly defined behaviour - `std::future`/`std::promise`: require allocation and reference counting of shared state, synchronization even for sequential activities AND detached i.e. operation continues even if `std::future` does not wait for result (potential for dangling references) --- # Coroutines: Async IO - C++20 coroutines allow code as the one below - network echo work with coroutine support - motivation for this series of articles/presentations ```cpp task
echo(socket s) { buffer buff; while (true) { std::size_t length = co_await async_read_some(s, buff); co_await async_write(s, buff, length); } } ``` - easy to understand behaviour e.g.: - scope for the buffer - exceptions (but slow, so use sparingly) - cancellation - disadvantage: additional allocation --- # Coroutines: Async IO - without coroutines it's a lot of callbacks and manual memory management ```cpp class session : public std::enable_shared_from_this
{ // ... void do_read() { auto self(shared_from_this()); socket_.async_read_some(buff_/*...*/, [this, self](boost::system::error_code ec, std::size_t length) { do_write(length); }); } void do_write(std::size_t length) { auto self(shared_from_this()); boost::asio::async_write(buff_/*...*/, [this, self](boost::system::error_code ec, std::size_t /*length*/) { do_read(); }); } ``` --- # Nanocoroutines - motivation: handle memory latencies for database JOINs - CppCon 2018: G. Nishanov [“Nano-coroutines to the Rescue! (Using Coroutines TS, of Course)”](https://www.youtube.com/watch?v=j9tlJAqMV7U) - the "slow" operation is fetching data from memory to CPU cache - the allocation overheads of coroutines will have to be mitigated somehow --- # Generator ```cpp struct fib_state { int x_1 = 1; int x_2 = 0; }; int next_fib(fib_state & state) { int val = state.x_1 + state.x_2; state.x_2 = state.x_1; state.x_1 = val; return val; } void foo() { fib_state state; while (true) { int x = next_fib(state); // use x, exit loop at some point } } ``` --- # Generator ```cpp generator fib() { int x_1 = 1; int x_2 = 0; while (true) { int val = x_1 + x_2; yield val; x_2 = x_1; x_1 = val; } } void foo() { for (int x: fib()) { // use x, exit loop at some point } } ``` - `val` is `yield`ed before `x_2` and `x_1` are updated --- # Generator - using C++23 ```cpp std::generator
fib() { int x_1 = 1; int x_2 = 0; while (true) { int val = x_1 + x_2; co_yield val; x_2 = x_1; x_1 = val; } } void foo() { for (int x: fib()) { // use x, exit loop at some point } } ``` --- # GPU work - there is a lot of effort (e.g. by Nvidia) to make calculations work from C++ (algorithms, coroutines, sender/receives etc.) compatible with "normal" CPU usage - beware that choices made for a use case (e.g. GPU) might be not the right ones for another use case (e.g. async networking IO) --- # Embedded work - don't throw exceptions - don't allocate --- # Kernel mode - don't throw exceptions - allocations: don't throw `std::bad_alloc` either --- # Sender/receiver - concepts for concurrency included in C++26 - can work with coroutines, CPU, GPU, embedded - can deliver fast code, no allocations - but with development complexity costs for simple sequential code --- # Choices --- # Interaction with threads - single thread simplifies a lot of things, but has limits of how much computation is available - explicit multithreading - can a coroutine resume in a different thread? - `scoped_lock|lock_guard
` potential for misuse - detached operations - work is not waited to complete: usually a bad pattern - but in the embedded work it makes sense to continue work, triggered by interrupts, long after main exited --- # Exception-less mode - e.g. for using in kernel code without using C++ exceptions - this was important for Microsoft (that significantly drove towards C++ coroutine adoptions) --- # Memory - how do we control allocations? - even better: how do we avoid allocations? - in some cases it's important to avoid allocations: - hot/performance sensitive code - kernel/embedded environments - also stack usage: it's easy to stack overflow --- # Timing matters/choices - read operations might get more than what was requested (e.g. when request is less than data in the IP package). Subsequent reads: - could always queue the continuation (to allow fairness, at cost of additional work) - avoid queueing if data already available (maximise speed by avoiding work in this case) - wait when_all(op1, op2, op3) can mean different things for different environments: - initiate all operations, the progress in interleaved, return when the last one (by time) completes (e.g. op2) - initiate sequentially, progress is not interleaved, return when the last one (in sequence) completes (i.e. op3) --- # Cancellation - it's easy to cancel before starting work - but in some cases the work can be pending for long periods of time: e.g. wait for registry entry to change (i.e. even never) => cancellation mechanism IS required --- # Programming ergonomics - A library that can do anything might sacrifice ease of use - and might not do specific things well - boost::asio and CreateFileA - networking TS and HTTPS - How easy is it to shoot yourself in the foot? - sometimes safety is sacrificed to allow some usage scenarios --- # Questions?