Typescript compilation principles

Preface:

The use of typescript(TS) is widespread enough that we should use it, or at least touch it, in our daily development. We tend to focus on its usage, such as how to elegantly and efficiently declare complex data structure types. However, we rarely look down at how TS works, or how TS compiles. Online information about this part of the content of the introduction is also very little, this paper from the principle of the level of some TS compilation principle. This is just a stepping stone, more in-depth and thorough learning to understand its working mechanism also need to refer to ts source code [github.com/microsoft/T…] Further study.

A, an overview of

Ts compilation can be broken down into several large steps:

Scanner Scanner (scanner.ts)
The Parser (parser.ts)
Binder Binder (binder.ts)
Checker Checker (checker.ts)
Emitter (emitter.ts)

Parentheses are the corresponding file names in the source code. How do these steps fit together? A flow chart will be more clear!

As you can see, the compilation process is roughly divided into two steps to achieve two goals:

Two purposes:

Type checking
Compile to JS code

There are two corresponding processes:

Source code -> Scanner -> Token stream -> Parser -> AST Abstract syntax tree -> Binder -> Symbol, then AST Abstract syntax tree -> Checker + Symbols -> type check
AST -> Inspector + emitter -> JS code

🤩, the process seems to be more, each step is what? What does t stand for? How does it work? . Let’s talk about them in the following sections.

Ii. Process description

2.1 the scanner

As you can see from the flowchart, the purpose of the scanner is to generate token flows. What is token flow? It’s simply a lexical unit. We know that JS compilation is divided into three steps: lexical unit (token stream), abstract syntax tree (AST), and conversion to executable code. An 🌰 :

var a = 2;
// The assignment statement above is decomposed into the following lexical units:
//var, a, =, 2; . These lexical units form an array of lexical unit streams
// Result of lexical analysis["var" : "keyword"."a" : "identifier"."=" : "assignment"."2" : "integer".";" : "eos" (end of statement)
          ]
// This array is the token array; According to the token stream, the lexical unit stream array is further converted into an AST tree, which is composed of elements nested step by step and represents the syntax structure of the program{operation: "=".left: {keyword: "var".right: "a"}right: "2"}Copy the code

Now we know that the scanner is the one that generates that token array. Why say so, because I have evidence, look at the source:

export function createScanner(languageVersion: ScriptTarget, skipTrivia: boolean, languageVariant = LanguageVariant.Standard, text? : string, onError? : ErrorCallback, start? : number, length? : number) :Scanner {
      let pos: number;
      let end: number;
      let startPos: number;
      let tokenPos: number;
      let token: SyntaxKind;
      // ...
      return {
          getStartPos: () = > startPos,
          getTextPos: () = > pos,
          getToken: () = > token,
          getTokenPos: () = > tokenPos,
          // ...
          scan,
          // ...
      };
Copy the code

CreateScanner Creates a scanner. The logic for scanning is in the scan function:

function scan() :SyntaxKind {
    startPos = pos;
    hasExtendedUnicodeEscape = false;
    precedingLineBreak = false;
    tokenIsUnterminated = false;
    numericLiteralFlags = 0;
    while (true) {
        tokenPos = pos;
        if (pos >= end) {
            return token = SyntaxKind.EndOfFileToken;
        }
        let ch = text.charCodeAt(pos);

        // Special handling for shebang
        if (ch === CharacterCodes.hash && pos === 0 && isShebangTrivia(text, pos)) {
            pos = scanShebangTrivia(text, pos);
            if (skipTrivia) {
                continue;
            }
            else {
                returntoken = SyntaxKind.ShebangTrivia; }}switch (ch) {
              // ...
Copy the code

We can see that the scan function returns a type of SyntaxKind, which is the definition of an enumerated type for a lexical keyword:

  // token > SyntaxKind.Identifer => token is a keyword
  // Also, If you add a new SyntaxKind be sure to keep the `Markers` section at the bottom in sync
  export const enum SyntaxKind {
      Unknown,
      EndOfFileToken,
      SingleLineCommentTrivia,
      MultiLineCommentTrivia,
      NewLineTrivia,
      WhitespaceTrivia,
      // We detect and preserve #! on the first line
      ShebangTrivia,
      // We detect and provide better error recovery when we encounter a git merge marker. This
      // allows us to edit files with git-conflict markers in them in a much more pleasant manner.
      ConflictMarkerTrivia,
      // Literals
      NumericLiteral,
      StringLiteral,
      JsxText,
      JsxTextAllWhiteSpaces,
      RegularExpressionLiteral,
      NoSubstitutionTemplateLiteral,
      // Pseudo-literals
      TemplateHead,
      TemplateMiddle,
      TemplateTail,
      // Punctuation
      OpenBraceToken,
      ReturnKeyword,
      SuperKeyword,
      SwitchKeyword,
      // ...
  }

Copy the code

From the above, the scanner obtains the corresponding SyntaxKind, or “token”, through lexical analysis of the input source code. We can write our own example to generate a token stream:

// The ntypescriptz library exposes many of the TS apis, so we can use the methods in the source code
  import * as ts from 'ntypescript';
// Create scanner
  const scanner = ts.createScanner(ts.ScriptTarget.Latest, true);
// Initialize the scanner
  function initializeState(text: string) {
      scanner.setText(text);
      scanner.setScriptTarget(ts.ScriptTarget.ES5);
      scanner.setLanguageVariant(ts.LanguageVariant.Standard);
  }

  const str = 'const a = 1; '

  initializeState(str);

  var token = scanner.scan();
// Call scan to obtain the token. As long as the token is not the ending token the scanner will keep scanning the entered string
  while(token ! = ts.SyntaxKind.EndOfFileToken) {console.log(token);
      console.log(ts.formatSyntaxKind(token));
      token = scanner.scan();
  }

Copy the code

The following output is displayed:

 76 // SyntaxKind[76],ConstKeyword
 ConstKeyword / / the corresponding const
 71
 Identifier / / a
 58
 EqualsToken / / =
 8
 NumericLiteral / / 1
 25
 SemicolonToken / / corresponding;
Copy the code

Well, we know that the scanner function is used to generate the token stream, and we know what the token stream looks like

2.2 the parser

Yes, the parser generates the AST abstract syntax tree from the Token stream! Let’s start with a classic example of generating an AST:

  import * as ts from 'ntypescript';
  function printAllChildren(node: ts.Node, depth = 0) {
      console.log(new Array(depth + 1).join(The '-'), ts.formatSyntaxKind(node.kind), node.pos, node.end);
      depth++;
      node.getChildren().forEach(c= > printAllChildren(c, depth));
  }
  var sourceCode = `const foo = 123; `;
  var sourceFile = ts.createSourceFile('foo.ts', sourceCode, ts.ScriptTarget.ES5, true);
  printAllChildren(sourceFile);
Copy the code

Output:

  SourceFile 0 16
  ---- SyntaxList 0 16
  -------- VariableStatement 0 16
  ------------ VariableDeclarationList 0 15
  ---------------- ConstKeyword 0 5
  ---------------- SyntaxList 5 15
  -------------------- VariableDeclaration 5 15
  ------------------------ Identifier 5 9
  ------------------------ EqualsToken 9 11
  ------------------------ NumericLiteral 11 15
  ------------ SemicolonToken 15 16
  ---- EndOfFileToken 16 16
Copy the code

An AST is actually a large object. Each node in the tree represents a structure in the source code, such as the corresponding type of the node, the starting position of the node, and so on. So the parser implementation should be similar to the code above. Parser. Ts also uses createSourceFile:

  export function createSourceFile(fileName: string, sourceText: string, languageVersion: ScriptTarget, setParentNodes = false, scriptKind? : ScriptKind) :SourceFile {
      performance.mark("beforeParse");
      const result = Parser.parseSourceFile(fileName, sourceText, languageVersion, /*syntaxCursor*/ undefined, setParentNodes, scriptKind);
      performance.mark("afterParse");
      performance.measure("Parse"."beforeParse"."afterParse");
      return result;
  }
Copy the code

Obviously, performance.mark(“beforeParse”); And the performance. Mark (” afterParse “). It marks the time before and after parsing, so the middle sentence should be the parsing process.

  function parseSourceFileWorker(fileName: string, languageVersion: ScriptTarget, setParentNodes: boolean, scriptKind: ScriptKind) :SourceFile {
      // Create a target for parsing
      sourceFile = createSourceFile(fileName, languageVersion, scriptKind);
      sourceFile.flags = contextFlags;

      // Prime the scanner.
      // Execute nextToken() to update scanned tokens
      nextToken();
      // Generate various information for each token (including starting and ending points)
      processReferenceComments(sourceFile);
      // Based on the creation of nodes and node information, since the AST is made up of nodes
      // The parseList function is called at a higher level than the parseList function
      sourceFile.statements = parseList(ParsingContext.SourceElements, parseStatement);
      Debug.assert(token() === SyntaxKind.EndOfFileToken);
      sourceFile.endOfFileToken = addJSDocComment(parseTokenNode() as EndOfFileToken);

      setExternalModuleIndicator(sourceFile);

      sourceFile.nodeCount = nodeCount;
      sourceFile.identifierCount = identifierCount;
      sourceFile.identifiers = identifiers;
      sourceFile.parseDiagnostics = parseDiagnostics;

      if (setParentNodes) {
          fixupParentReferences(sourceFile);
      }

      return sourceFile;
  }
Copy the code

So the above function means creating nodes for the token loop that parses the target, thus forming the AST

2.3 binding

The binder is used to create the symbol “symbols”. To assist in type checking, the binder connects the pieces of the source code into a related type system for the inspector to use.

2.3.1 symbol

Symbols connect nodes declared in the AST to the same entity as other declarations. Symbols are the basic building blocks of semantic systems. So what does the symbol look like?

  function Symbol(flags: SymbolFlags, name: string) {
  this.flags = flags;
  this.name = name;
  this.declarations = undefined;
  }
Copy the code

The SymbolFlags are enumerated flags used to identify additional symbol classes (e.g. For details, see the enumeration definition for SymbolFlags in Compiler/Types. As you can see, a symbol is also an object that contains a symbol, a name, and a declaration.

2.3.2 Creating symbol & binding nodes

Ts bindSourceFile is the entry to the binder:

 export function bindSourceFile(file: SourceFile, options: CompilerOptions) {
     performance.mark("beforeBind");
     binder(file, options);
     performance.mark("afterBind");
     performance.measure("Bind"."beforeBind"."afterBind");
 }
Copy the code

Does that sound familiar? Oreo filling! Performance.mark () marks the beginning and end of the binding.

     function bind(node: Node) :void {
     if(! node) {return;
     }
     node.parent = parent;
     const saveInStrictMode = inStrictMode;

     // Even though in the AST the jsdoc @typedef node belongs to the current node,
     // its symbol might be in the same scope with the current node's symbol. Consider:
     //
     // /** @typedef {string | number} MyType */
     // function foo();
     //
     // Here the current node is "foo", which is a container, but the scope of "MyType" should
     // not be inside "foo". Therefore we always bind @typedef before bind the parent node,
     // and skip binding this tag later when binding all the other jsdoc tags.
     if (isInJavaScriptFile(node)) bindJSDocTypedefTagIfAny(node);

     // First we bind declaration nodes to a symbol if possible. We'll both create a symbol
     // and then potentially add the symbol to an appropriate symbol table. Possible
     // destination symbol tables are:
     //
     // 1) The 'exports' table of the current container's symbol.
     // 2) The 'members' table of the current container's symbol.
     // 3) The 'locals' table of the current container.
     //
     // However, not all symbols will end up in any of these tables. 'Anonymous' symbols
     // (like TypeLiterals for example) will not be put in any table.
     bindWorker(node);
     // Then we recurse into the children of the node to bind them as well. For certain
     // symbols we do specialized work when we recurse. For example, we'll keep track of
     // the current 'container' node when it changes. This helps us know which symbol table
     // a local should go into for example. Since terminal nodes are known not to have
     // children, as an optimization we don't process those.
     if (node.kind > SyntaxKind.LastToken) {
         const saveParent = parent;
         parent = node;
         const containerFlags = getContainerFlags(node);
         if (containerFlags === ContainerFlags.None) {
             bindChildren(node);
         }
         else {
             bindContainer(node, containerFlags);
         }
         parent = saveParent;
     }
     else if(! skipTransformFlagAggregation && (node.transformFlags & TransformFlags.HasComputedFlags) ===0) {
         subtreeTransformFlags |= computeTransformFlagsForNode(node, 0);
     }
         inStrictMode = saveInStrictMode;
     }
Copy the code

A lot of comments inside the source code, look at the comments can probably understand the meaning. The first thing it does is assign Node. parent (if the parent variable is set, the binder will be set again in the bindChildren function’s processing), and then give bindWorker the ability to call the corresponding binding function for each node. Finally, bindChildren is called recursively (this function simply stores the state of the binder (such as parent) into the function’s local variable, then calls bind on each child node, and then transfers the state back into the binder). So the focus is on the bindWorker function.

function bindWorker(node: Node) {
 switch (node.kind) {
     case SyntaxKind.Identifier:
         if ((<Identifier>node).isInJSDocNamespace) {
             let parentNode = node.parent;
             while(parentNode && parentNode.kind ! == SyntaxKind.JSDocTypedefTag) { parentNode = parentNode.parent; } bindBlockScopedDeclaration(<Declaration>parentNode, SymbolFlags.TypeAlias, SymbolFlags.TypeAliasExcludes);break;
         }
     // ...
 }

Copy the code

This function switches according to Node. kind (type SyntaxKind) and delegates the work to the appropriate bindXXX function (also defined in binder.ts). For example, if the node is invoked bindBlockScopedDeclaration Identifier. BindXXX family functions have some general patterns and utility functions. One of the most commonly used is the createSymbol function:

function createSymbol(flags: SymbolFlags, name: string): Symbol {
  symbolCount++;
  return new Symbol(flags, name);
}
Copy the code

This function updates symbolCount and creates symbols with the specified parameters. After the symbol is created, it is necessary to bind the node to realize the link between the node and the symbol:

 function addDeclarationToSymbol(symbol: Symbol, node: Declaration, symbolFlags: SymbolFlags) {
     symbol.flags |= symbolFlags;
     // Create a connection between the AST node and symbol
     node.symbol = symbol;
     if(! symbol.declarations) { symbol.declarations = []; }// Add the node as a declaration of the symbol
     symbol.declarations.push(node);
     // ...
 }
Copy the code

The above code performs the following operations:

Create a link from the AST node to the symbol (node.symbol)
Adds the node to the symbolaThe statement

At this point, the source code -> Scanner -> Token stream -> parser -> AST -> binder -> Symbol process is complete. The rest are type checking and code firing.

2.4 the viewer

The inspector is located in checker.ts and currently has more than 23K lines of code (the largest part of the compiler), as illustrated here in a nutshell. The inspector is initialized by the program. Here is the call stack:

GetTypeChecker -> ts. CreateTypeChecker -> initializeTypeChecker -> initializeTypeCheckerfor each SourceFile `ts.bindSourceFile`(Binder)/ / then
          for each SourceFile `ts.mergeSymbolTable`(Inspector)Copy the code

I can see that the bindSourceFile of the binder and mergeSymbolTable of the inspector itself are called when initializeTypeChecker is used. In fact, as mentioned in the previous section, bindSourceFile’s function is to eventually create a symbol for each node, linking each node into a related type system. So what does mergeSymbolTable do? Instead, it merges all the global symbols into the let globals: SymbolTable = {} SymbolTable. All future type checks can be verified on global. The real type checking happens when getDiagnostics is called.

 function getDiagnostics(sourceFile: SourceFile, ct: CancellationToken) :Diagnostic[] {
     try {
         cancellationToken = ct;
         return getDiagnosticsWorker(sourceFile);
     }
     finally {
         cancellationToken = undefined; }}Copy the code

The getDiagnosticsWorker function calls a number of functions that are too complicated to expand. Here is a picture from the Internet to show the general process.Here again, if you want to understand the specific principle or to see the code, their own research, all the summary content is based on the source code! This is a piece of cake) :Inspector source summary: it is based on our generated AST node declaration start node position of the string passed in to do position type syntax and so on checksum exception thrown. This completes the AST -> inspector ~~ Symbol -> type checking.

2.5 the emitter

The TypeScript compiler provides two emitters:

emitter.ts: It is a TS -> JavaScript emitter
declarationEmitter.ts: This emitter is used forTypeScript source files (.ts)createDeclaration Document (.d.ts)

Program provides an EMIT function. This function primarily delegates functionality to emitFiles in Emitter.ts. Here is the call stack:

Program.emit ->
    `emitWorker`(createProgram in program.ts) ->`emitFiles`(Functions in Emitters. Ts)Copy the code

EmitWorker (via the emitFiles parameter) provides an EmitResolver to the Emitter. The EmitResolver is provided by the program’s TypeChecker, which is basically a collection of native functions from createChecker. The launcher fires different code depending on the hint:

function pipelineEmitWithHint(hint: EmitHint, node: Node) :void { 
switch (hint)
 {   case EmitHint.SourceFile: return pipelineEmitSourceFile(node); 
     case EmitHint.IdentifierName: return pipelineEmitIdentifierName(node);
     case EmitHint.Expression: return pipelineEmitExpression(node);
     case EmitHint.Unspecified: returnpipelineEmitUnspecified(node); }}Copy the code

This completes the inspector + emitter ->js process.

Three, the last

Reference article:

Understand typescript in depth
Get a deeper understanding of Typescript from the compiler
Ts compiled source code

Thank you!