This series of articles is a translation and reading notes of Node.js Design Patterns Second Edition, updated on GitHub with a link to the translation.

Please pay attention to my column, the following blog will be synchronized in the column:

  • The nuggets column of Encounter
  • Programming Thoughts on Zhihu’s Encounter
  • Segmentfault front end station

Asynchronous Control Flow Patterns with Callbacks

Languages like Node.js are used to synchronous programming styles, and their CPS style and asynchronous apis are standard, which may be difficult for beginners to understand. Writing asynchronous code can be a different experience, especially for asynchronous control flows. Asynchronous code can make it difficult to predict the order in which statements will be executed in Node.js. Reading a set of files, performing a list of tasks, or waiting for a set of actions to complete, for example, requires developers to adopt new methods and techniques to avoid ending up writing inefficient and unmaintainable code. A common mistake is callback hell, where code volumes soar and become unreadable, making even simple programs difficult to read and maintain. In this chapter, we’ll see how to avoid callbacks and write clean, manageable asynchronous code by using some rules and patterns. We’ll see that control-flow libraries, such as Async, can greatly simplify our problems, making our code more readable and easier to maintain.

The difficulties of asynchronous programming

It’s easy to get asynchronous code out of order in JavaScript. Closures and the definition of anonymous functions allow developers to have a better programming experience without having to manually manage and jump to asynchronous operations. This is in accordance with the KISS principle. Simple and able to maintain asynchronous code control flow, allowing it to work in less time. Unfortunately, callback nesting comes at the expense of things like modularity, reusability, and maintainability, increasing the size of the entire function, and resulting in poor code structure. In most cases, creating closures is functionally unnecessary, but this is more of a constraint than an issue related to asynchronous programming. Recognizing that callback nesting makes our code unwieldy, and then dealing with callback hell based on the most appropriate solution, is the difference between a novice and an expert.

Create a simple Web crawler

To explain the above, we created a simple Web crawler, a command-line application that takes a URL as input and can then download its contents into a file. In the following code, we will rely on the following two NPM libraries.

In addition, we’ll reference a local module called./ Utilities.

The core functionality of our application is contained in a module called spider.js. First load the dependency packages we need, as shown below:

const request = require('request');
const fs = require('fs');
const mkdirp = require('mkdirp');
const path = require('path');
const utilities = require('./utilities');Copy the code

Next, we’ll create a new function called spider() that takes a URL as an argument and calls a callback function when the download process is complete.

function spider(url, callback) {
  const filename = utilities.urlToFilename(url);
  fs.exists(filename, exists => {
    if(! exists) {console.log(`Downloading ${url}`);
      request(url, (err, response, body) => {
        if (err) {
          callback(err);
        } else {
          mkdirp(path.dirname(filename), err => {
            if (err) {
              callback(err);
            } else {
              fs.writeFile(filename, body, err => {
                if (err) {
                  callback(err);
                } else {
                  callback(null, filename, true); }}); }}); }}); }else {
      callback(null, filename, false); }}); }Copy the code

The above functions perform the following tasks:

  • Check theURLVerify that the corresponding file has been created:

fs.exists(filename, exists => ...

  • If the file has not already been downloaded, execute the following code to download it:

request(url, (err, response, body) => ...

  • Then, we need to determine if the file is already in the directory:

mkdirp(path.dirname(filename), err => ...

  • And finally, let’s takeHTTPThe body of the request message is written to the file system:

mkdirp(path.dirname(filename), err => ...

To complete our Web crawler application, we simply provide a URL as input (in our case, we read it from a command-line argument), and we simply call the spider() function.

spider(process.argv[2], (err, filename, downloaded) => {
  if (err) {
    console.log(err);
  } else if (downloaded) {
    console.log(`Completed the download of "${filename}"`);
  } else {
    console.log(`"${filename}" was already downloaded`); }});Copy the code

Now let’s try to run the Web crawler application, but first, make sure all dependencies in the utilities.js module and package.json are installed in your project:

npm installCopy the code

After that, we execute our crawler module to download a web page using the following command:

node spider http://www.example.comCopy the code

Our Web crawler application requires that the protocol type always be included in the urls we provide (for example, http://)). Also, don’t expect HTML links to be rewritten, and don’t expect resources like images to be downloaded, as this is just a simple example of how asynchronous programming works.

The callback hell

Looking at our spider() function, we can see that although the algorithm we implemented was very simple, the generated code had several levels of indentation and was hard to read. Implementing something similar with a blocking synchronization API is simple, and rarely does it look so wrong. However, using asynchronous CPS is another matter, and using closures can lead to hard-to-read code.

The large number of closures and callbacks that turn code into an unreadable, unmanageable situation is called callback hell. It is one of the most recognized and serious anti-patterns in Node.js. In general, for JavaScript. The typical structure of code affected by this problem is as follows:

asyncFoo(err= > {
  asyncBar(err= > {
    asyncFooBar(err= > {
      / /...
    });
  });
});Copy the code

We can see how code written this way forms a pyramid shape that is hard to read due to deep embedding, called the “pyramid of Doom.”

The most obvious problem with code like the previous snippet is poor readability. Because the nesting is so deep, it is almost impossible to keep track of where the callback ends and where another callback starts.

Another problem is caused by overlapping variable names used in each scope. Often, we must use similar or even identical names to describe the contents of variables. The best example is the error argument received by each callback. Some people often try to use variations of the same name to distinguish objects within each scope, for example, Error, err, err1, err2, and so on. Others prefer to hide variables defined in scopes and always use the same name. For example, err. Both options are far from perfect and can be confusing and increase the potential for bugs.

In addition, we must keep in mind that closures cost little in terms of performance and memory consumption. In addition, they can create memory leaks that are not easily identified, because we should not forget that any context variables referenced by closures are not retained by the garbage collection.

See Vyacheslav Egorov’s blog post on how closures for V8 work.

If we look at our spider() function, we will clearly notice that it is a classic callback hell scenario, and has all the problems we just described in this function. This is exactly what the patterns and techniques we’ll study in this chapter address.

Use simple JavaScript

Now that we’ve encountered our first callback hell example, we know what we should avoid. However, this is not the only concern when writing asynchronous code. In fact, there are several situations where controlling the flow of a set of asynchronous tasks requires specific patterns and techniques, especially if we just use plain JavaScript without the help of any external libraries. For example, traversing a collection by applying asynchronous operations sequentially is not as simple as calling forEach() in an array, but it actually requires a recursion-like technique.

In this section, we’ll learn how to avoid callback hell and how to implement some of the most common control flow patterns using simple JavaScript.

Criteria for callback functions

The first rule to keep in mind when writing asynchronous code is not to abuse closures when defining callbacks. Overusing closures is nice for a while because it doesn’t require extra thought about issues like modularity and reusability. But we have seen that this does more harm than good. Most of the time, fixing the callback hell problem doesn’t require any libraries, fancy techniques, or paradigm shifts, just common sense.

Here are some basic principles that can help us nest less and improve the organization of our code:

  • Exit the outer function whenever possible. Depending on the context, usereturn,continueorbreakTo immediately exit the current code block instead of using itif... elseCode block. Other statements. This will help optimize our code structure.
  • Create named functions for callbacks, avoid closures, and pass intermediate results as arguments. Named functions also make them more elegant in stack traces.
  • Code as modular as possible. And break the code up into smaller, reusable functions whenever possible.

Criteria for callbacks

To demonstrate the above principles, let’s refactor a Web crawler application.

For the first step, we can refactor our error checking by removing the ELSE statement. This is returned from the function immediately after we receive an error. So, look at the following code:

if (err) {
  callback(err);
} else {
  // If there are no errors, execute the code block
}Copy the code

We can improve our code structure by writing:

if (err) {
  return callback(err);
}
// If there are no errors, execute the code blockCopy the code

With this simple technique, we immediately reduce the nesting level of functions, which is simple and does not require any complex refactoring.

A common mistake when performing the optimizations we just described is forgetting to terminate a function, called a return, after calling a callback function. For error-handling scenarios, the following code is a typical source of bugs:

if (err) {
  callback(err);
}
// If there are no errors, execute the code blockCopy the code

In this case, the execution of the function continues even after the callback is called. A return statement is necessary to avoid this situation. Also note that it doesn’t matter what output the function returns; the actual result (or error) is generated asynchronously and passed to the callback. The return value of an asynchronous function is usually ignored. This property allows us to write code like this:

returncallback(...) ;Copy the code

Otherwise we have to break it into two statements:

callback(...) ;return;Copy the code

As we continue refactoring our spider() function, we can try to identify reusable code fragments. For example, the ability to write a given string to a file can easily be broken down into a single function:

function saveFile(filename, contents, callback) {
  mkdirp(path.dirname(filename), err => {
    if (err) {
      return callback(err);
    }
    fs.writeFile(filename, contents, callback);
  });
}Copy the code

Following the same principles, we can create a generic function called Download () that takes the URL and filename as input and downloads the contents of the URL into the given file. Internally, we can use the saveFile() function we created earlier.

function download(url, filename, callback) {
  console.log(`Downloading ${url}`);
  request(url, (err, response, body) => {
    if (err) {
      return callback(err);
    }
    saveFile(filename, body, err => {
      if (err) {
        return callback(err);
      }
      console.log(`Downloaded and saved: ${url}`);
      callback(null, body);
    });
  });
}Copy the code

Finally, modify our spider() function:

function spider(url, callback) {
  const filename = utilities.urlToFilename(url);
  fs.exists(filename, exists => {
    if (exists) {
      return callback(null, filename, false);
    }
    download(url, filename, err => {
      if (err) {
        return callback(err);
      }
      callback(null, filename, true); })}); }Copy the code

The functionality and interface of the spider() function are still exactly the same; only the way the code is organized has changed. By applying these basic principles, we were able to greatly reduce the nesting of our code while increasing its reusability and testability. In fact, we can consider exporting saveFile() and Download () so that we can reuse them in other modules. This also made it easier to test their functionality.

The refactoring we’ve done in this section makes it clear that most of the time, all we need is some rules and to make sure we don’t abuse closures and anonymous functions. It works very well, requires minimal work, and uses only raw JavaScript.

sequential

Now to explore the execution order of the asynchronous control flow, we will explore its control flow by starting to analyze a string of asynchronous code.

Performing a set of tasks in sequence means running them one after another at a time. The order of execution is important and must be correct because the results of one task in the list can affect the execution of the next. The following diagram illustrates this concept:

The above asynchronous control flow has a few different variations:

  • Perform a set of known tasks in sequence without linking or passing results
  • Use the output of the task as the next input (also known aschain.pipelineOr,waterfall)
  • Iterate over a collection as you run an asynchronous task on each element, element by element

For sequential execution, although it is simple to implement using the direct style blocking API, using asynchronous CPS in general causes callback hell problems.

Perform a set of known tasks in sequence

We already ran into sequential execution problems when we implemented the spider() function in the previous section. By exploring the following ways, we can better control asynchronous code. Using this code as a guideline, we can use the following pattern to solve the above problem:

function task1(callback) {
  asyncOperation((a)= > {
    task2(callback);
  });
}

function task2(callback) {
  asyncOperation(result() => {
    task3(callback);
  });
}

function task3(callback) {
  asyncOperation((a)= > {
    callback(); //finally executes the callback
  });
}

task1((a)= > {
  //executed when task1, task2 and task3 are completed
  console.log('tasks 1, 2 and 3 executed');
});Copy the code

The above pattern shows that after an asynchronous operation is complete, the next asynchronous operation is invoked. This pattern emphasizes modularity of tasks and avoids the use of closures when working with asynchronous code.

Sequential iterative

The patterns we described earlier are perfect if we know in advance what to perform and how many tasks to perform. This allows us to hardcode the call to the next task in the sequence, but what happens if we want to perform an asynchronous operation on each item in the collection? In this case, we cannot hardcode the sequence of tasks. Instead, we have to build it dynamically.

Web crawler version 2

To show an example of sequential iteration, let’s introduce a new feature for our Web crawler application. We now want to recursively download all the links in the web page. To do this, we’ll extract all the links from the page and trigger our Web crawler application one at a time in sequence.

The first step is to modify our spider() function to trigger a recursive download of all the links on the page by calling a function called spiderLinks().

Also, instead of checking to see if the file already exists and starting to crawl its links, we now try to read the file. This way, we can resume the interrupted download. One final change is that we make sure the parameters we pass are up to date and limit the recursion depth. The resulting code is as follows:

function spider(url, nesting, callback) {
  const filename = utilities.urlToFilename(url);
  fs.readFile(filename, 'utf8', (err, body) => {
    if (err) {
      if(err.code! = ='ENOENT') {
        return callback(err);
      }
      return download(url, filename, (err, body) => {
        if (err) {
          return callback(err);
        }
        spiderLinks(url, body, nesting, callback);
      });
    }
    spiderLinks(url, body, nesting, callback);
  });
}Copy the code
Crawl links

Now we can create the core of this new version of our Web crawler application, the spiderLinks() function, which uses a sequential asynchronous iterative algorithm to download all the links of the HTML page. Notice the way we define it in the following code block:

function spiderLinks(currentUrl, body, nesting, callback) {
  if(nesting === 0) {
    return process.nextTick(callback);
  }

  let links = utilities.getPageLinks(currentUrl, body); / / [1]
  function iterate(index) { / / [2]
    if(index === links.length) {
      return callback();
    }

    spider(links[index], nesting - 1.function(err) { / / [3]
      if(err) {
        return callback(err);
      }
      iterate(index + 1);
    });
  }
  iterate(0); / / [4]
}Copy the code

The important steps from this new feature are as follows:

  1. We use theutilities.getPageLinks()The getLink () function gets a list of all the links contained in the page. This function returns only links to the same host name.
  2. We use something callediterate()The local function to traverse the link, which requires the index of the next link to parse. In this function, we first check if the index is equal to the length of the linked array, and if it is, the iteration is complete, in which case we call it immediatelycallback()Function, because that means we’re processing all the items.
  3. At this point, the processing link is ready. We call it recursivelyspider()Function.
  4. As aspiderLinks()The last and most important step of the function is callediterate(0)To start the iteration.

The algorithm we just proposed allows us to iterate over arrays by performing asynchronous operations sequentially, in our case the spider() function.

We can now try out this new version of the Web crawler application and watch it recursively download all the links of a Web page, one by one. To interrupt this process, it may take a while if there are many links, keep in mind that we can always use Ctrl + C. If we decide to restore it, we can restore execution by launching the Web crawler application and providing the same URL as when it ended last time.

Now our Web crawler application may trigger a download of the entire site, please consider using it carefully. For example, do not set high nesting levels or leave crawlers running for more than a few seconds. It is unethical to reload a server with thousands of requests. In some cases, this is also considered illegal. Need to consider the consequences!

Iterative model

The code for the spiderLinks() function we showed earlier is a clear example of how collections can be iterated over when asynchronous operations are applied. We can also note that this is a pattern that can be adapted to any other case where we need to iterate asynchronously over the elements of the collection or the usual list of tasks in order. This model can be extended as follows:

function iterate(index) {
  if (index === tasks.length) {
    return finish();
  }
  const task = tasks[index];
  task(function() {
    iterate(index + 1);
  });
}

function finish() {
  // Iterate over the completed operation
}

iterate(0);Copy the code

Note that if Task () is a synchronous operation, these types of algorithms become truly recursive. In this case, it is possible to overflow the call stack.

The model we just proposed is very powerful because it can be adapted to several situations. For example, we can map the values of an array, or we can implement a Reduce algorithm by passing the results of an iteration to the next one in the iteration, we can exit the loop early if certain conditions are met, or we can even iterate over an infinite number of elements.

We can also choose to further promote the solution:

iterateSeries(collection, iteratorCallback, finalCallback);Copy the code

The list of tasks is executed by creating a function called iterator that calls the next executable task in the collection and ensures that the iterator-terminated callback function is called when the current task completes.

parallel

In some cases, the order in which a set of asynchronous tasks are executed doesn’t matter; we just need to be notified when all of these running tasks are complete. Use parallel execution flow to better handle this situation, as shown below:

This may sound strange if we think of Node.js as single-threaded, but if we remember what we discussed in Chapter 1, we realize that even if we only have one thread, we can still achieve concurrency due to the non-blocking nature of Node.js. In fact, the parallel word is used incorrectly in this case, because it does not mean that the tasks are running at the same time, but rather that their execution is performed by the underlying non-blocking API and is interwoven by the event loop.

We know that when one task allows an event loop to execute another task, or a task allows control to return to the event loop. The name for this workflow is concurrency, but we’ll still use parallelism for simplicity.

The following figure shows how two asynchronous tasks can be run in parallel in node.js programs:

From the figure above, we have a Main function that performs two asynchronous tasks:

  1. MainFunction to triggerTask 1andTask 2The execution. Because these trigger asynchronous operations, the two functions return immediately, returning control to the main function and notifying the main thread when the event loop is complete.
  2. whenTask 1When the asynchronous operation completes, the event loop gives control to its thread. whenTask 1It notifies when the synchronization operation is completeMainFunction.
  3. whenTask 2When the asynchronous operation completes, the event loop gives control to its thread. whenTask 2It notifies again when the synchronization operation is completeMainFunction. At this point,MainFunction is unknownTask 1andTask 2Has already been executed, so it can continue to perform subsequent operations or return the result of the operation to another callback function.

In a nutshell, this means that in Node.js we can only perform parallel asynchronous operations because their concurrency is handled internally by non-blocking apis. Synchronous blocking operations cannot run simultaneously in Node.js, unless their execution is interlaced with asynchronous operations, or delayed through setTimeout() or setImmediate(). We’ll see this in more detail in Chapter 9.

Web crawler version 3

The Web crawler above also seems to work perfectly in parallel asynchronous operations. So far, the application is performing the download of the linked page recursively. But the performance is not optimal, and it is easy to improve the performance of this application.

To do this, we just need to modify the spiderLinks() function to make sure that the spider() task is executed only once, and when all the tasks are finished, the final callback is called, so we make the following changes to the spiderLinks() function:

function spiderLinks(currentUrl, body, nesting, callback) {
  if (nesting === 0) {
    return process.nextTick(callback);
  }
  const links = utilities.getPageLinks(currentUrl, body);
  if (links.length === 0) {
    return process.nextTick(callback);
  }
  let completed = 0,
    hasErrors = false;

  function done(err) {
    if (err) {
      hasErrors = true;
      return callback(err);
    }
    if(++completed === links.length && ! hasErrors) {return callback();
    }
  }
  links.forEach(link= > {
    spider(link, nesting - 1, done);
  });
}Copy the code

What changes to the above code? Now all the tasks of the spider() function start synchronously. By simply iterating through the linked array and starting each task, we don’t have to wait for the previous task to complete before moving on to the next:

links.forEach(link= > {
  spider(link, nesting - 1, done);
});Copy the code

Then, the way to let our application know that all the tasks are complete is to provide a special callback for the spider() function, which we call done(). When the crawler task completes, the done() function sets a counter. When the number of downloads completed reaches the size of the linked array, the final callback is called:

function done(err) {
  if (err) {
    hasErrors = true;
    return callback(err);
  }
  if (++completed === links.length && !hasErrors) {
    callback();
  }
}Copy the code

With the changes above, if we were to try to run our crawler on a web page now, we would notice a big improvement in the speed of the whole process, since each download is executed in parallel without having to wait for previous links to be processed.

model

In addition, for parallel execution processes, we can extract our solutions to adapt to different situations to improve code reusability. We can use the following code to represent a generic version of the schema:

const tasks = [ / *... * / ];
let completed = 0;
tasks.forEach(task= > {
  task((a)= > {
    if(++completed === tasks.length) { finish(); }}); });function finish() {
  // Called when all tasks are completed
}Copy the code

With minor modifications, we can adjust the schema to accumulate the results of each task into a list to filter or map the elements of the array, or to call the Finish () callback once one or a certain number of tasks have been completed.

Note: If there is no limit, a set of asynchronous tasks are executed in parallel and then a callback is executed after all the asynchronous tasks have completed by counting the number of their executions.

Fix race conditions with concurrent tasks

Running a set of tasks in parallel can cause problems when using a combination of blocking I/O and multithreading. However, as we just saw, running multiple asynchronous tasks in parallel in Node.js is actually less expensive in terms of resources. This is one of the most important advantages of Node.js, so parallelization is a common practice in Node.js, and it’s not a very complicated technique.

Another important feature of Node.js’s concurrency model is the way we handle task synchronization and race conditions. In multithreaded programming, this is often done using constructs such as locks, mutex conditions, semaphores, and observers, which are among the most complex aspects of parallelization in multithreaded languages and have a significant impact on performance. In Node.js, we usually don’t need a fancy synchronization mechanism because everything runs on a single thread! However, this does not mean that we do not have competitive conditions. On the contrary, they can be quite common. The root of the problem is the delay between invocation of an asynchronous operation and notification of its results. As a concrete example, we can refer again to our Web crawler application, especially the last version we created, which actually contained a race condition.

The problem is the spider() function that checks if the file already exists before starting to download the document for the corresponding URL:

function spider(url, nesting, callback) {
  if(spidering.has(url)) {
    return process.nextTick(callback);
  }
  spidering.set(url, true);

  const filename = utilities.urlToFilename(url);
  fs.readFile(filename, 'utf8'.function(err, body) {
    if(err) {
      if(err.code ! = ='ENOENT') {
        return callback(err);
      }

      return download(url, filename, function(err, body) {
        if(err) {
          return callback(err);
        }
        spiderLinks(url, body, nesting, callback);
      });
    }

    spiderLinks(url, body, nesting, callback);
  });
}Copy the code

The problem is that two crawler tasks operating on the same URL might download and create a file before one of the two tasks completes the download, causing fs.readfile () to be called on the same file incorrectly before the second task starts the download, resulting in two downloads. This situation can be seen in the figure below:

The figure above shows how Task 1 and Task 2 are interlaced in a single thread of Node.js, and how asynchronous operations actually introduce race conditions. In our case, both crawler tasks end up downloading the same file. How can we solve this problem? The answer is much simpler than we might think. In fact, all we need is a variable (a mutex variable) that can exclude multiple spider() tasks running on the same URL. This can be done with the following code:

const spidering = new Map(a);function spider(url, nesting, callback) {
  if (spidering.has(url)) {
    return process.nextTick(callback);
  }
  spidering.set(url, true);
  // ...
}Copy the code

Parallel execution frequency limit

In general, parallel tasks can lead to overload if the frequency of parallel tasks is not controlled. Imagine having thousands of files to read, urls to access or database queries running in parallel. A common problem in this case is insufficient system resources, such as taking advantage of all file descriptors available to the application when trying to open too many files at once. In Web applications, it may also create a vulnerability that exploits denial-of-service (DoS) attacks. In all such cases, it is best to limit the number of tasks running at the same time. This way, we can add some predictability to the server load and ensure that our application does not run out of resources. The following figure describes a situation where we limit the concurrency of five tasks running in parallel to two:

It is clear from the above figure how our algorithm works:

  1. We can perform as many tasks as possible without exceeding the concurrency limit.
  2. Each time a task is complete, we execute one or more tasks, making sure we don’t hit the limit.

Concurrency limit

We now propose a pattern to perform a given set of tasks in parallel with limited concurrency:

const tasks = ...
let concurrency = 2, running = 0, completed = 0, index = 0;

function next() {
  while (running < concurrency && index < tasks.length) {
    task = tasks[index++];
    task((a)= > {
      if (completed === tasks.length) {
        return finish();
      }
      completed++, running--;
      next();
    });
    running++;
  }
}
next();

function finish() {
  // All tasks completed
}Copy the code

The algorithm can be thought of as a hybrid between sequential and parallel execution. In fact, we might notice similarities between the two patterns we described earlier:

  1. We have an iterator function that we callnext(), has an internal loop that executes as many tasks in parallel as possible while maintaining concurrency limits.
  2. The callback we pass to each task checks to see that all the tasks in the list have been completed. If there are other tasks to run, it callsnext()For the next mission.

Global concurrency limit

Our Web crawler application is well suited to applying what we’ve learned about limiting the concurrency of a set of tasks. In fact, to avoid climbing thousands of links at the same time, we can limit concurrency by adding some measures to the number of concurrent downloads.

Versions of Node.js prior to 0.11 already limited the number of concurrent HTTP connections per host to 5. However, this can be changed to suit our needs. Please check out the official document nodejs.org/docs/v0.10…. More from AxSockets. As of Node.js 0.11, there is no default limit on the number of concurrent connections.

We can apply the pattern we just learned to our spiderLinks() function, but all we’ll get is to limit the concurrency of a set of links in a page. If we choose concurrency 2, we can download up to two links in parallel for each page. However, since we can download more than one link at a time, each page produces two more downloads, recursively, and the concurrency limit is not completely met.

Using the queue

What we really want is to limit the number of global download operations we can run in parallel. We could modify the pattern shown earlier slightly, but we’d rather use it as an exercise because we want to introduce another mechanism that uses queues to limit the concurrency of multiple tasks. Let’s see how this works.

We will now implement a class called TaskQueue that combines queues with the algorithms we mentioned earlier. We create a new module called taskqueue.js:

class TaskQueue {
  constructor(concurrency) {
    this.concurrency = concurrency;
    this.running = 0;
    this.queue = [];
  }
  pushTask(task) {
    this.queue.push(task);
    this.next();
  }
  next() {
    while (this.running < this.concurrency && this.queue.length) {
      const task = this.queue.shift();
      task((a)= > {
        this.running--;
        this.next();
      });
      this.running++; }}};Copy the code

The constructor of the above class only serves as a concurrency limit for the input, but otherwise it initializes variables for the run and queue. The former variable is a counter to keep track of all running tasks, while the latter is an array that will be used as a queue to store pending tasks.

The pushTask() method simply adds a new task to the queue and then guides the execution of the task by calling this.next().

The next() method generates a set of tasks from the queue, ensuring that it does not exceed the concurrency limit.

We may notice some similarities between this approach and patterns that limit the concurrency we mentioned earlier. It basically starts as many tasks as possible from the queue without exceeding the concurrency limit. As each task completes, it updates the count of running tasks, and then calls next() again to start another round of tasks. The interesting property of the TaskQueue class is that it allows us to dynamically add new items to the queue. Another advantage is that we now have a central entity responsible for limiting the concurrency of our tasks, which can be shared across all instances of function execution. In our case, it’s the spider() function, which we’ll see later.

Web crawler version 4

Now that we have a generic queue to perform tasks in a limited parallel process, we can use it directly in our Web crawler application. We first load the new dependency and create a new instance of the TaskQueue class by setting the concurrency limit to 2:

const TaskQueue = require('./taskQueue');
const downloadQueue = new TaskQueue(2);Copy the code

Next, we update the spiderLinks() function with the newly created downloadQueue:

function spiderLinks(currentUrl, body, nesting, callback) {
  if (nesting === 0) {
    return process.nextTick(callback);
  }
  const links = utilities.getPageLinks(currentUrl, body);
  if (links.length === 0) {
    return process.nextTick(callback);
  }
  let completed = 0,
    hasErrors = false;
  links.forEach(link= > {
    downloadQueue.pushTask(done= > {
      spider(link, nesting - 1, err => {
        if (err) {
          hasErrors = true;
          return callback(err);
        }
        if(++completed === links.length && ! hasErrors) { callback(); } done(); }); }); }); }Copy the code

This new implementation of this function is very easy, and it is very similar to the infinitely parallel execution algorithm mentioned earlier in this chapter. This is because we delegate concurrency control to the TaskQueue object, and the only thing we need to do is check that all tasks are complete. See how our task is defined in the code above:

  • We run this by providing custom callbacksspider()Function.
  • In a callback, we check withspiderLinks()Function to perform all tasks related to completion. When this condition is true, we call the last callback of the spiderLinks () function.
  • At the end of our task, we calldone()Callback so that the queue can continue execution.

After we make these small changes, we can now try to run the Web crawler application again. This time, we should note that there won’t be more than two downloads at the same time.

Async library

If we take a look at each of the control flow patterns we have analyzed so far, we can see that they can be used as a basis for building reusable and more general solutions. For example, we can wrap an unrestricted parallel execution algorithm into a function that accepts a list of tasks, runs them in parallel, and calls the given callback function when they are all done. This approach of transforming control flow algorithms into reusable capabilities can lead to a more declarative and expressive way of defining asynchronous control flows, which is exactly what Async does. The Async library is a very popular solution in Node.js and JavaScript for handling asynchronous code. It provides a set of capabilities that greatly simplifies the execution of a set of tasks in different configurations and provides useful assistance for asynchronous processing collections. Even though there are several other libraries with similar goals, async is a de facto standard in Node.js due to its popularity.

sequential

The Async library can be a great help in implementing complex asynchronous control flows, but the challenge is choosing the right library to solve the problem. For example, there are about 20 different functions to choose from for sequential execution, Including eachSeries(), mapSeries(), filterSeries(), rejectSeries(), Reduce (), reduceRight(), detectSeries(), concatSeries(), series(), whilst(), doWhilst(), until(), doUntil(), forever(), waterfall(), compose(), seq(), applyEachSeries(), The iterator (), and timesSeries ().

Choosing the right functions is an important step toward writing more solid and readable code, but it also requires some experience and practice. In our example, we’ll cover only a few of these cases, but they will still provide a solid foundation for understanding and using the rest of the library effectively.

Below is an example of how the Async library works that we will use for our Web crawler application. We started directly from version 2 and recursively downloaded all the links in order.

But first we make sure to install the Async library into our current project:

npm install asyncCopy the code

Then we need to load a new dependency from the spider.js module:

const async = require('async');Copy the code

Given the sequence of execution of a set of tasks

Let’s start by modifying the download() function. As shown below, it does the following three things in order:

  1. downloadURLThe content of the.
  2. Create a new directory if one does not already exist.
  3. willURLSave the content to a file.

Async.series () can implement sequential execution of a set of tasks:

async.series(tasks, [callback])Copy the code

Async.series () takes as arguments a list of tasks and a callback function to be called when all tasks are complete. Each task is just a function that receives a callback function, which is called when the task completes execution:

function task(callback) {}Copy the code

The advantage of Async is that it uses the same callback convention as Node.js, which automatically handles error propagation. So, if any task calls its callback and produces an error, Async will skip the rest of the tasks in the list and jump straight to the last callback.

With this in mind, let’s look at how to modify the download() function above by using async:

function download(url, filename, callback) {
  console.log(`Downloading ${url}`);
  let body;
  async.series([
    callback= > {
      request(url, (err, response, resBody) => {
        if (err) {
          return callback(err);
        }
        body = resBody;
        callback();
      });
    },
    mkdirp.bind(null, path.dirname(filename)),
    callback => {
      fs.writeFile(filename, body, callback);
    }
  ], err => {
    if (err) {
      return callback(err);
    }
    console.log(`Downloaded and saved: ${url}`);
    callback(null, body);
  });
}Copy the code

Using async allows us to better organize our asynchronous tasks compared to the callback hell version of this code. And there are no nested callbacks, because we only need to provide a list of tasks, usually for each asynchronous operation, and then the asynchronous tasks will execute in sequence:

  1. The first is downloadURLThe content of the. We save the response body to a closure variable (body) so that it can be shared with other tasks.
  2. Create and save the directory of downloaded pages. We do this by implementingmkdirp()Function is implemented and bound to the directory path created. This way, we can save a few lines of code and increase its readability.
  3. Finally, we will download theURLWrite the contents of the file. In this case, we cannot execute part of the application (as we did in the second task) because of the variablesbodyOnly available after the download task in the series is complete. However, by passing the callback of the task directly tofs.writeFile()Function, we can still save some lines of code by taking advantage of asynchronous automatic error management.

    4. After all tasks are complete, theasync.series()The final callback. In our case, we just do some error management and returnbodyVariable back and forthdownload()Function.

For the above, an alternative to async.series() is async.waterfall(), which still executes tasks sequentially but also provides the output of each task as the next input. In our case, we can use this feature to propagate the body variable until the end of the sequence.

Sequential iterative

Earlier, you saw how to perform a set of tasks in sequence. The above example async.series() does this. The same functionality can be used to implement the spiderLinks() function in Web crawler version 2. However, async provides a more appropriate API for a particular case, traversing a collection, which is async.eachSeries(). Let’s use it to re-implement our spiderLinks() function (version 2, serial download) as follows:

function spiderLinks(currentUrl, body, nesting, callback) {
  if (nesting === 0) {
    return process.nextTick(callback);
  }
  const links = utilities.getPageLinks(currentUrl, body);
  if (links.length === 0) {
    return process.nextTick(callback);
  }
  async.eachSeries(links, (link, callback) => {
    spider(link, nesting - 1, callback);
  }, callback);
}Copy the code

If we compare the above code using Async with code that implements the same functionality using a pure JavaScript pattern, we’ll notice that Async gives us a huge advantage in terms of code organization and readability.

Parallel execution

Async does not have the capability to process parallel streams, where each(), map(), filter(), reject(), detect(), some(), every(), concat(), parallel(), applyEach(), and times() can be found. They follow the same logic as the functions we have seen for sequential execution, except that the tasks provided are executed in parallel.

To demonstrate this, we can try implementing version 3 of our Web crawler application with one of the above features, which uses an unlimited parallel process to perform the download.

If we remember the code we used earlier to implement the sequential version of the spiderLinks() function, it is relatively easy to adjust it to work in parallel:

function spiderLinks(currentUrl, body, nesting, callback) {
  // ...
  async.each(links, (link, callback) => {
    spider(link, nesting - 1, callback);
  }, callback);
}Copy the code

This function is exactly the same as we used for sequential downloads, but uses async.each() instead of async.eachSeries(). This clearly demonstrates the power of abstracting asynchronous flows using libraries such as Async. The code is no longer tied to a specific execution flow; there is no code written specifically for that. Most of it is just application logic.

Limiting parallel execution

If you’re wondering how Async can also be used to limit the concurrency of parallel tasks, the answer is yes. We have a few functions that we can use, namely eachLimit(), mapLimit(), parallelLimit(), Queue (), and cargo().

We tried to use one of them to implement version 4 of a Web crawler application that performs link downloads in parallel with limited concurrency. Fortunately, Async has async.queue(), which works in a similar way to the TaskQueue created earlier in this chapter. The async.queue() function creates a new queue that uses a worker() function to perform a set of tasks with a specified concurrency limit:

const q = async.queue(worker, concurrency);Copy the code

The worker() function takes the task to run as input and a callback function as arguments, and executes the callback when the task completes:

function worker(task, callback);Copy the code

It should be noted that tasks can be of any type in this example, not just functions. In fact, the worker is responsible for handling the task in the most appropriate way. To create a task, use q.ush (task, callback) to add the task to the queue. After a task is processed, the callback function associated with a task must be called by the worker.

Now, let’s modify our code again to implement a fully parallel execution flow with concurrent limitations, using async.queue(). First, we need to create a queue:

const downloadQueue = async.queue((taskData, callback) = > {
  spider(taskData.link, taskData.nesting - 1, callback);
}, 2);Copy the code

The code is simple. We are creating a new queue with a concurrency limit of 2 that lets a worker call our spider() function with only the data associated with the task. Next, we implement the spiderLinks() function:

function spiderLinks(currentUrl, body, nesting, callback) {
  if (nesting === 0) {
    return process.nextTick(callback);
  }
  const links = utilities.getPageLinks(currentUrl, body);
  if (links.length === 0) {
    return process.nextTick(callback);
  }
  const completed = 0,
    hasErrors = false;
  links.forEach(function(link) {
    const taskData = {
      link: link,
      nesting: nesting
    };
    downloadQueue.push(taskData, err => {
      if (err) {
        hasErrors = true;
        return callback(err);
      }
      if (++completed === links.length && !hasErrors) {
        callback();
      }
    });
  });
}Copy the code

The previous code should look very familiar, as it is almost identical to the code that implements the same process using TaskQueue objects. Furthermore, an important part of the analysis in this case is the position to push the new task into the queue. At this point, we make sure that we pass a callback that allows us to check that all the download tasks for the current page are complete and eventually invoke the final callback.

Thanks to async.queue(), we can easily copy the functions of our TaskQueue objects, again proving that with async, we can avoid writing asynchronous control flow patterns from scratch, reduce our workload and simplify our code.

conclusion

At the beginning of this chapter, we said that Node.js programming can be difficult because of its asynchronous nature, especially for people who have previously developed on other platforms. In this chapter, however, we show how asynchronous apis can start with simple native JavaScript, setting the stage for analyzing more complex technologies. Then we saw that in addition to providing programming styles for every taste, the tools at our disposal were really diverse and provided good solutions to most of our problems. For example, we can choose the Async library to simplify the most common processes.

There are more advanced technologies, such as the Promise and Generator functions, which will be the focus of the next chapter. When you know all of these technologies, you can choose the best solution based on your requirements or use multiple technologies in the same project.