Premise:

  1. By default, you have a basic understanding of coroutines.
  2. When I wrote this, I thought I knew very little about coroutines. It was more for the record, and based on personal experience, I believe I will not be able to bear reading this article for some time.
  3. The level is limited (too poor), if you find any improper expression in the article, welcome to correct ~

Coroutines definition

The proposal of coroutines can be traced back to Melvin E. Conway’s 1963 paper “Design of A Transition-Diagram Compiler”.

In this paper, he proposed organizing compilers as a set of coroutines, which made it possible to use separate channels in debugging and then run single-pass compilers in production. The passage from the article reads:

coroutines are subroutines all at the same level, each acting as if it were the master program when in fact there is no master program.

This is perhaps the earliest description of coroutines, but unfortunately the concept of coroutines still seems to lack a clear and universally agreed definition. As a result, you can see that different languages/systems have different implementations based on different understandings and expectations. Here, we define it in combination with the coroutine characteristics summarized by Dr. Marlin in his paper:

  • The local data of coroutines can persist in successive calls
  • When a transfer of control occurs, execution of the original coroutine is suspended and will resume from where it left off only if control returns to enter the coroutine at a later stage.

Of course, this representation does not clarify the underlying structure of the coroutine, leaving some doubts that lead to the classification of coroutines from different perspectives.

  1. Based on the difference of control transfer mechanism between coroutines, it can be divided into symmetric coroutines and asymmetric coroutines.
  2. There are stackful coroutines and stackless coroutines based on whether they are implemented as a stack structure.
  3. There are also categories based on whether they are provided as first-class objects or as constrained constructs in the language.

In this article, we will focus on the difference between Stackful coroutine and Stackless Coroutine.

Why Coroutine?

We use the producer-consumer model to illustrate the use of coroutines. It is important to note that system-level coroutines are often referred to by another name, fiber, to distinguish them from language implementations.

The traditional producer-consumer model assumes that a process has two threads, one as consumer and one as producer, and they share a common address space. We can simply implement the producer-consumer model in this way. But this approach is tedious and performance-intensive because you are constantly switching context between producer and consumer. In addition, such thread switching is scheduled by THE OS. There may be many programs running on the computer, and the scheduler does not know the relationship between the producer and the consumer. It is likely that the producer has completed the generation of data long ago, but the consumer cannot get the scheduling.

One solution to these problems is to put them directly into a thread context and write a bunch of codependency checking code, but that’s a fallback:

  1. We use multi-threaded model to solve the producer-consumer problem, in order to achieve decoupling and asynchronous between producers and consumers, hoping that they can work independently as much as possible without considering each other.
  2. This practice is not natural. Producers and consumers should cooperate with each other rather than have clear caller and Callee

Coroutines are similar in concept to threads, in fact in OS they are also called user-mode threads, but rather than preemptively scheduled execution of threads, the active yield scheduling of coroutines is more efficient because there is much less context to maintain when switching from user-mode to kernel mode.

import asyncio
import random

async def produce(queue, n) :
    for item in range(n):
        Create a project that uses sleep to simulate I/O operations
        print('producing item {} ->'.format(item)) 
        await asyncio.sleep(random.random())
        # queue items
        await queue.put(item)
    # Indicate completion of production
    await queue.put(None)

async def consume(queue) :
    while True:
        Wait for items from producers
        item = await queue.get()
        if item is None:
            break
        Consume this project, using sleep to simulate I/O operations
        print('consuming item {} <-'.format(item))
        await asyncio.sleep(random.random()) 

async def main() :
    queue = asyncio.Queue()
    task1 = asyncio.create_task(produce(queue, 10))
    task2 = asyncio.create_task(consume(queue))
    await task1
    await task2

asyncio.run(main())
Copy the code

Stacked and stackless coroutines

From our definition of coroutines above, you can see that the key to implementing coroutines is how data flow and control flow are maintained during jumps. Let’s not talk about holding, how do you implement control flow transfer?

  1. Call a new function

  2. Returns from the current function

stackful

The first scheme is the idea of stack coroutines, which is a very natural idea. We let each coroutine have its own stack and control block. Before the coroutine is suspended, the non-volatile registers of the currently active coroutine are stored in the control block of the coroutine. The newly activated coroutine is recovered from its associated control block before resuming, and the coroutine context switch is implemented by swapping stack space. In fact, you can see getContext, setContext, swapContext, and makecontext function declarations in glibc/ uContext.h, and some open source coroutines are based on these.

There’s a question here, how is stack space allocated?

One solution is independent memory stack, that is, each coroutine is allocated a fixed size of stack space, but how much to allocate, less easy to stack overflow, more easy to waste memory, to the programmer allocation and increase the mental burden. In order to prevent stack bursts, we usually assign each task an upper limit of stack space, adjust the stack pointer directly when the task is switched, but only one coroutine is running in one thread at a time. Of course you can put it on the heap, but obviously the overhead of calling the program increases significantly.

The other option is shared memory stack. In this way, we pre-allocate a large chunk of stack space as shared stack memory, and when suspend or transfer, we backup it based on the actual space used by the current coroutine, that is, dynamic allocation. The disadvantage of shared stack compared with independent stack is that task switching requires more complicated copy.

Post: Cloud wind coroutine annotation version: github.com/chenyahui/A…

stackless

The stackless coroutine takes the second approach, which you can think of as implemented based on the object model, where the context of the coroutine is the object’s member variables. Of course, you can also interpret it as a state machine model, as follows:

struct coroutine {
    int i;
    int value = 0;
    void next(a) {
        switch(value) {
        case 0:
            return frist();
        case 1:
            return second();
    }
    void frist(a) {
        i = 0;
        value = 1;
    }
    void second(a) {
        i++;
        value = 2; }};Copy the code

Since there is no need to switch stack frames, the performance of the stackless coroutine is better than that of the stacked coroutine, and the memory footprint is much better. Moreover, it performs the traditional function call function return, without manually changing the stack pointer and without breaking the return stack Buffer jump prediction, but its implementation requires the support of the compiler. Overall compatibility is not as good as stack coroutines.

Misunderstand, supplement

The distinction between stack and stackless coroutines is easily misunderstood.

In this case, stacked and unstacked refers to whether a coroutine is implemented as a stack structure, in other words, whether a coroutine persists in a stack structure when it is suspended. You’ll see a lot of articles and papers where they argue that the difference between stacked coroutines and stack-free coroutines is whether you can hang in nesting, which is actually not accurate.

How do you understand that?

Imagine that we call a Coroutine C. When it calls a regular routine f, where does f’s call frame go? If a call to f() causes another coroutine to hang, then the call frame of F () itself must be saved along with the frame of C ().

This can be done with stackful coroutine solutions, such as the Cactus Stack solution (shown below), but it still introduces considerable complexity. Therefore, the stacked coroutines of many languages restrict this by requiring each coroutine to be declared in the outermost layer of lexical nesting. At this point the coroutine stacks are completely disjointed, such as modula-2 with Stackful Coroutine.

Now, there is no regular stack structure for calls suspended by coroutine. It is difficult to

The generator and async/await

Coming back to JS, many people would say that generator and async/await are stackless coroutines, which is not quite accurate.

This python’s pep in the specification to reflect: www.python.org/dev/peps/pe…

The purpose of introducing async/await is to achieve a simple mental model, easy to use and as close as possible to synchronous programming model, better realize concurrent programming, and clearly distinguish from generator, eliminate the ambiguity between generator and coroutine.

A generator is a generator, or if you want to call it a semi-coroutine (because you can implement a coroutine based on a generator), and we use generators more for the ability to pause and resume execution of a function than for a coroutine.

Whether async/await is a generator sugar is entirely up to the compiler implementation.

We can see how V8 handles generator functions in JS with d8 –print-bytecode [path].js.

Take this code for example:

function* testGenerator (){
  yield 1;
  yield 'a string'
}

function main(){
   let gen = testGenerator()
   gen.next()
   gen.next()
}

main()
Copy the code

The generated bytecode is too long, and only part of the bytecode is truncated (testGenerator part of the bytecode)

Create an instance of JSGenerator with invokeInstrinsic[_GreateJSGeneratorObject]. Class JSGenerator is defined here

The two commands to focus on are SuspendGenerator and ResumeGenerator.

IGNITION_HANDLER(SuspendGenerator, InterpreterAssembler) {
  Node* generator = LoadRegisterAtOperandIndex(0);
  TNode<FixedArray> array = CAST(LoadObjectField(
      generator, JSGeneratorObject::kParametersAndRegistersOffset));
  Node* closure = LoadRegister(Register::function_closure());
  Node* context = GetContext(a); RegListNodePair registers =GetRegisterListAtOperandIndex(1);
  Node* suspend_id = BytecodeOperandUImmSmi(3);

  Node* shared =
      LoadObjectField(closure, JSFunction::kSharedFunctionInfoOffset);
  TNode<Int32T> formal_parameter_count = UncheckedCast<Int32T>(
      LoadObjectField(shared, SharedFunctionInfo::kFormalParameterCountOffset,
                      MachineType::Uint16()));

  ExportParametersAndRegisterFile(array, registers, formal_parameter_count);
  StoreObjectField(generator, JSGeneratorObject::kContextOffset, context);
  StoreObjectField(generator, JSGeneratorObject::kContinuationOffset,
                   suspend_id);

  Node* offset = SmiTag(BytecodeOffset());
  StoreObjectField(generator, JSGeneratorObject::kInputOrDebugPosOffset,
                   offset);

  UpdateInterruptBudgetOnReturn(a);Return(GetAccumulator());
}
Copy the code

In simple terms, the SuspendGenerator handler calls LoadRegister, GetContext, LoadObjectField, StoreObjectField, etc., to hold the state and record the offset of the bytecode, And then directly put the accumulator (V8 is through the simulation of the physical machine to execute bytecode, and register-based design) value to return, also is the current stack frame is out of the generator function, the suspension is how to achieve?

Very simply, you’ll notice that the last line of a bytecode processing function almost always calls the Dispatch function, which takes the next bytecode generated by the current function and executes it. SuspendGenerator does not execute Dispatch, and the pause position is recorded by the offset.

IGNITION_HANDLER(ResumeGenerator, InterpreterAssembler) {
  Node* generator = LoadRegisterAtOperandIndex(0);
  Node* closure = LoadRegister(Register::function_closure());
  RegListNodePair registers = GetRegisterListAtOperandIndex(1);

  Node* shared =
      LoadObjectField(closure, JSFunction::kSharedFunctionInfoOffset);
  TNode<Int32T> formal_parameter_count = UncheckedCast<Int32T>(
      LoadObjectField(shared, SharedFunctionInfo::kFormalParameterCountOffset,
                      MachineType::Uint16()));

  ImportRegisterFile(
      CAST(LoadObjectField(generator,
                           JSGeneratorObject::kParametersAndRegistersOffset)),
      registers, formal_parameter_count);

  SetAccumulator(
      LoadObjectField(generator, JSGeneratorObject::kInputOrDebugPosOffset));

  Dispatch();
}
} 
Copy the code

ResumeGenerator is basically a reverse process, restoring the previously saved state and calling Dispatch to retrieve the next bytecode and continue executing generator.

In V8 async/await is implemented on top of Generator. Await also generates SuspendGenerator and ResumeGenerator.

The JSAsyncGeneratorObject class inherits the JSGeneratorObject class.

This is a pity state that the compiler will create for the following code, which will be queued up with a microTask. When the execution state is saved, paused, and then iterated through the microtask queue, the state is restored and executed, and placed in the browser, whose scheduling is again affected by the browser event loop.

async function testAsync(){
    const res = await 0
    return res
}

testAsync()
Copy the code

reference

  • Conway, Melvin E.. “Design of a Separable Transition-Diagram Compiler”. Commun. ACM, Volume 6 Issue 7, July 1963, Article No. 7
  • Ana Lucia De Moura, Roberto Ierusalimschy. “Revisiting Coroutines “. ACM, February 2009, Article No.6
  • A Way to Practice Programming Languages – 3rd Edition
  • Principles and Implementation of Modern Operating Systems
  • www.boost.org/doc/libs/1_…
  • Is async/await asynchronous model superior to stackful Coroutine model? – the answer ball-point pen – zhihu www.zhihu.com/question/65…
  • Zh.wikipedia.org/wiki/%E5%8D…
  • Mthli. Xyz/stackful – st…
  • Cloud.tencent.com/developer/a…