Most of the images in this article are from the Internet

preface

On December 9, 2021, The official Vercel blog post is called Vercel Acquires Turborepo to Accelerate Build Speed and Improve Developer Experience and, as its title says, Vercel acquired Turborepo to speed up builds and improve the development experience.

Turborepo is a high-performance build system for JavaScript and TypeScript code libraries. With incremental builds, intelligent remote caching, and optimized task scheduling, Turborepo can increase build times by 85% or more, enabling teams of all sizes to maintain a fast and efficient build system that expands as the code base and team grow.

This post highlights Turborepo’s strengths succinctly. This post will take a real-world look at some of the problems Turborepo can encounter with large code repositories (Monorepo), combine that with the industry’s existing solutions, and see what Turborepo has innovated in terms of task scheduling.

Self-cultivation of a qualified Monorepo

As the business develops and the team changes, the projects in the business Monorepo will increase gradually. The extreme example is that Google put the entire company’s code in a warehouse, the size of the warehouse is up to 80TB.

Business Monorepo: Unlike Lib Monorepo (React, Vue3, Next. Js and Babel packages), business Monorepo organizes multiple business applications and their dependent common component libraries or tool libraries into a repository. — “Eden Monorepo Series: Brief Analysis of Eden Monorepo Engineering Construction”

The increase in the number of projects means that while enjoying the advantages of Monorepo, it also brings great challenges. Excellent Monorepo tools can make developers enjoy the advantages of Monorepo without any burden, while poor Monorepo tools can make developers feel miserable. It even makes you wonder why Monorepo exists.

Here are some practical scenarios I encountered:

  1. Dependency version conflict
    1. Create a new project that cannot start due to dependency issues
    2. Create a new project, other projects cannot start due to dependency issues
  2. Dependency installation is slow
    1. Initial installation depends on 20 minutes +
    2. Added a dependency of 3min+
  3. Tasks such as build, test, and Lint execute slowly

The author had previous experience in the implementation of Rush. In the practice, I found that in addition to the most basic code sharing ability, there should be at least three abilities, namely:

  1. Dependency management capabilities. With the increase in the number of dependencies, the dependency structure can still maintain the correctness, stability and installation efficiency.
  2. Task scheduling ability. The ability to execute Monorepo project tasks (narrowly defined as NPM scripts, such as build, test, lint, etc.) with maximum efficiency and in the correct order, without increasing the complexity of Monorepo projects.
  3. Release capability. Be able to correctly implement version number changes, CHANGELOG generation, and project releases based on changed projects and project dependencies.

The support capabilities of some popular tools are shown in the following table:

Dependency management Task scheduling Version management
Pnpm Workspace
Rush ✅ (by Pnpm)
Lage
Turborepo
Lerna
  1. Pnpm: Pnpm is capable of arranging tasks (--filterParameter), so it is also included here, and as Package Manager, it is itself an integral part of the large Monorepo.
  2. Rush: A Microsoft open source extensible Monorepo management solution with built-in PNPM and Changesets package distribution scheme. Its plug-in mechanism is a highlight, which makes it very convenient to implement custom functions using Rush built-in capabilities and takes the first step in the Rush plugin ecosystem.
  3. Lage: Also open-source by Microsoft, and personally considered the predecessor of Turborepo, the Go language version of Lage. Lage, which calls itself the “Monorepo Task Runner, “is much more modest than Turborepo’s” High-performance Build System, “and has an order of magnitude less Star count (Lage 300+, For Turborepo 5K +), see the PR for more. Lage is equivalent to Turborepo in this article.
  4. Lerna: Maintenance has been stopped, so it will not be included in subsequent discussions.

Dependency management is too low-level, version control is simple and mature, and it’s hard to take these two capabilities further. In practice, Pnpm and Changesets have been combined to complete the overall capabilities, or even specialize in task choreography, where Lage and Turborepo are strong.

How to choose the right Monorepo tool chain?

  1. Pnpm Workspace + Changesets: Low cost, suitable for most scenarios
  2. Pnpm Workspace + Changesets + Turborepo/Lage: Enhanced task choreography capabilities based on 1
  3. Rush: Well considered and scalable

Task orchestration can be divided into three steps, which are supported by the following tools:

scoping Parallel execution The cloud cache
Pnpm
Rush
Turborepo/Lage

Scoping: Perform subset tasks as needed

This capability has rich usage scenarios in daily development.

For example, to start the project app1 by pulling the repository for the first time, we need to build the pre-dependencies of App1 in Monorepo, package1 and Package2.

To package the app1 project on the SCM, you need to build the app1 itself and the pre-dependencies of App1 in Monorepo, Package1 and Package2.

At this point, projects that need to be built should be screened out as needed, rather than project builds that are irrelevant to the current intent.

In different Monorepo tools, this behavior is called differently:

  1. In Rush, Selecting subsets of projects is called Selecting subsets of projects. In this example, the following command should be used:
// Start app1 development mode locally, at the top of the dependency graph, but without building app1 itself$ rush build --to-except @monorepo/app1// SCM packages app1, which is the top of the dependency graph, and needs to build @monorepo/app1 itself$ rush build --to @monorepo/app1
Copy the code
  1. Pnpm calls it Filtering, which limits commands to a specific subset of packages. In this example, the following command should be used:
// Start app1 development mode locally, at the top of the dependency graph, but without building app1 itself$ pnpm build --filter @monorepo/app1^...// SCM packages app1, which is the top of the dependency graph, and needs to build @monorepo/app1 itself$ pnpm build --filter @monorepo/app1...
Copy the code
  1. Turborepo/Lage calls it Scoped Tasks, but for now (2022/02/13) this capability is too limited. The Vercel team is designing a filter syntax that is basically consistent with Pnpm. See RFC for details: New Task Filtering Syntax

Number of scoping ensures that the task does not increase with the increase of the project has nothing to do with Monorepo, the parameters of the rich can help us in various scenarios (package, app building and contracting out the CI task) for selecting/filtering/scoping.

For example, if package5 is modified, you can run the following command in the CI environment of Merge Request to ensure that package5 and projects dependent on package5 will not fail to build due to the modification:

Pnpm $Pnpm build --filter... Pnpm $Pnpm build --filter... @monorepo/package5...Copy the code

In this example, package5 and APP3 will eventually be selected for the build, thus meeting the minimum requirements for the code to be incorporated on CI — without affecting other project builds.

Json file of all projects in the workspace. Each Project knows its upstream Project Dependents and its downstream Project Dependencies, and matches the parameters passed in by the developer. So it is convenient to select subset items.

Parallel execution: Fully release machine performance

Assuming that 20 subsets of tasks are selected, how can these 20 tasks be performed to ensure correctness and efficiency?

There are dependencies between projects, so there are dependencies between tasks. Take build task as an example, the current Project can be built only after the pre-dependencies are built.

Given m urls, each time the maximum number of parallel requests is N, please implement the code to ensure the maximum number of requests.

In fact, the idea of this question is similar to the parallel execution of tasks in task scheduling, except that there is no dependency between urls in the interview question, and there is a topological order between tasks.

Then the execution of the task is clear:

  1. The initial executable task must be one that does not have any pre-tasks
    • The number of Dependencies is 0
  2. After a task is executed, it looks for the next available task in the task queue and executes it immediately
    • Update the number of Dependencies of a task to remove the current task (Dependencies quantity-1)
    • Whether a task can be executed depends on whether its Dependencies are all executed (the number of Dependencies is 0).

This paper does not explain the code level, specific implementation of the task scheduling mechanism in Monorepo, in the code level to achieve the task topological order parallel execution.

Breaking mission boundaries

This image is from Turborepo: Pipelining Package Tasks

When we talked about task execution, we did it all under the same type of task, such as build, Lint, or test. When executing build tasks in parallel, we did not consider Lint or test tasks. As shown in the Lerna area in the figure above, four tasks are executed successively, and each task is blocked by the previous one. Even though the internal execution is parallel, there is still a waste of resources between different tasks.

Lage/Turborepo provides developers with a way to define task relationships (see Turbo.json) that allow Lage/Turborepo to schedule and optimize different types of tasks.

Overlapping waterfall tasks are much more efficient than performing only one task at a time.

turbo.json

{
  "$schema": "https://turborepo.org/schema.json"."pipeline": {
    "build": {
      // Build after its dependency build command is complete
      "dependsOn": ["^build"]},"test": {
      // Test your own build command.
      "dependsOn": ["build"]},"deploy": {
      // Deploy after the own Lint build test command is complete
      "dependsOn": ["build"."test"."lint"]},// You can start Lint at any time
    "lint": {}}}Copy the code

Correct order

Rush was also discussed in March and October, and supported similar features by the end of 2010. For details, see [Rush] Add Support for Phased Commands

  • Turborepo: Pipelining Package Tasks
  • How does lage work?
  • [rush] Design proposal: “phased” custom commands #2300

Cloud cache: Reuse caches across multiple environments

Rush has an incremental build feature that allows Rush Builds to skip projects where input files have not changed since the last build and reuse caches across multiple environments with third-party storage services.

Rush introduced a plug-in mechanism in version 5.57.0 that enabled third-party remote caching capabilities (previously only available for Azure and Amazon), giving developers the ability to implement build caching solutions based on internal enterprise services.

Local development, CI, and SCM all benefit from being in a daily development scenario.

As mentioned above, the quality of Merge Request can be guaranteed to some extent by building the change project and its upstream and downstream projects in CI.

As shown in the figure above, there are scenarios where the code for Package0 has been modified. To ensure that its upstream and downstream builds are not affected, the following command is executed during the CI Build Changed Projects phase:

$ rush build --to package0 --from package0
Copy the code

Select projects based on git diff for changes to source files, in this case package0

After scoping, Package0 and its upstream APP1 will be included in the build process. Since APP1 needs to be built, package1 to Package5 also need to be built as its pre-dependencies. However, these five packages actually do not depend on package0, and there are no changes, just to complete the build preparation of APP1.

If the dependencies are complex, such as a base package referenced by multiple applications, the amount of preparatory build work, such as Package1-Package5, increases dramatically, resulting in a very slow CI at this stage.

Number of actual built projects = number of downstream projects of the project being changed + number of upstream projects of the project being changed + number of upstream projects of the project being changed

Because the five projects, including Package1-Package5, have no direct or indirect dependence on Package0, and the input file has not changed, they can hit the cache (if any) and skip the build behavior.

This reduces the build scope from 7 projects to 2 projects.

Number of actual built projects = number of downstream projects that changed the project + number of upstream projects that changed the project

How do I determine if the cache is hit?

In the cloud, the cache compression for each project build is mapped to the cacheId calculated by the input files. If the input files do not change, the cacheId value is not changed (content hashing) and hits the cloud cache.

The input file contains the following:

  1. Project code source file
  2. Project NPM dependency
  3. Cacheids for other Monorepo internal projects that the project depends on

If you are interested in the implementation, check out @rushstack/package-deps-hash.

conclusion

In the process of writing this article, I also thought of the three sorrycc builds mentioned in @sorrycc’s Systematized Way to Speed Up Front-end Builds on GMTC:

  1. Delay processing. Compile on demand, deferred compile sourcemap based on request
  2. The cache. Vite Optmize, Webpack5 physical cache, Babel cache
  3. Native Code. SWC, ESBuild

Native Code’s advantages as a task choreographer aren’t obvious (Turborepo is written in Go, but Lage’s authors don’t think the bottleneck to task choreography at its current scale lies in the choreographer itself), but latency is similar to caching.

I conclude this article with a concise and pragmatic subtitle on the Lage website:

Run all your npm scripts in topological order incrementally with cloud cache – @microsoft/lage

With cloud cache, run all your NPM scripts incrementally by topology sort.

reference

  • monorepo.tools: Your defacto #guide on #monorepos.
  • Rush: a scalable monorepo manager for the web
  • Lage: A Beautiful JS Monorepo Task Runner
  • JS Monorepo Workspace Tools
  • Pnpm-Filtering