A, start

In the last article, we introduced @babel/ Core. This time, we will analyze @babel/ Parser.

@babel/ Parser can be divided into two parts: lexical analysis and grammar analysis. Let’s see how they work and how they are combined.

2. API description

Previous versions of Babel called the Recast library directly to generate the AST. Now @babel/ Parser borrows a lot from Acorn.

Before looking at the source code, you can read the documentation first, and have a basic understanding of its functions and APIS.

The @babel/ Parser API mainly exposes parse and parseExpression, in the form of:

babelParser.parse(code, [options])
Copy the code

Look at the sourceType parameter in options. The values can be script, module, or unambiguous, and the default is script.

SourceType is judged by the presence of the import/export keyword if unambiguous, and updated to module if import/export is present, and script otherwise.

Third, source code analysis

The @babel/ Parser version for this analysis is v7.16.4.

1. Directory structure

Take a look at its directory structure:

- parser
  - base
  - comment
  - error
  - expression
  - index
  - lval
  - node
  - statement
- plugins
  - flow
  - jsx
  - typescript
- tokenizer
  - context
  - index
  - state
- utils
  - identifier
  - location
  - scope
- index
Copy the code

You can see that it consists of three main parts: Tokenizer, plugins, and Parser. Among them

  1. tokenizerUsed to parseToken, includingstateandcontext;
  2. parsercontainsnode,statement,expressionAnd so on;
  3. pluginsIs used to help parsets,jsxAnd so on grammar.

From the index.js perspective, when the parser.parse method is called, a parser is instantiated, which is the parser from the parser/index.js file above. The Parser implementation is interesting in that it has a long inheritance chain:

Why organize your code this way? The advantage of this inheritance chain is that you only need to call this to get the methods of other objects, and you can maintain a common data, or state, such as Node nodes, state, context context, scope, etc.

2. Operation mechanism

Here’s a very simple example of how parser works:

const { parse } = require('@babel/parser')

parse('const a = 1', {})
Copy the code

When calling the @babel/parser parse method, it instantiates a Parser using getParser and then calls the parse method on the parser.

function getParser(options: ? Options, input: string) :Parser {
  let cls = Parser;
  if(options? .plugins) { validatePlugins(options.plugins); cls = getParserClass(options.plugins); }return new cls(options, input);
}
Copy the code

Due to the inheritance of Parser, a series of initialization operations, including state, scope, and context, are performed in each class during instantiation.

We then enter the parse method, where the main logic consists of:

  1. Enter the scope of the initialization
  2. throughstartNodecreateFileNodes andProgramnode
  3. throughnextTokenTo get onetoken
  4. throughparseTopLevelRecurse and parse.
export default class Parser extends StatementParser {
  parse(): File {
    this.enterInitialScopes();
    const file = this.startNode();
    const program = this.startNode();
    this.nextToken();
    file.errors = null;
    this.parseTopLevel(file, program);
    file.errors = this.state.errors;
    returnfile; }}Copy the code

StartNode is a new Node in the class NodeUtil. The top Node of the AST is of type File, and it has a property program. Another Node needs to be created. The two nodes have the same start and LOC information.

export class NodeUtils extends UtilParser {
 startNode<T: NodeType>(): T {
    return new Node(this.this.state.start, this.state.startLoc); }}Copy the code

In Tokenizer, the main logic is to determine whether the current position is longer than the input length. If so, call finishToken to terminate the token, otherwise call getTokenFromCode or readTmplToken. To read the token from code.

Here ct.template starts with a backquote, and in our case const a = 1 goes to getTokenFromCode.

This.codepointatpos (this.state.pos) gets the current character, in our case C.

export default class Tokenizer extends ParserErrors {
 nextToken(): void {
    const curContext = this.curContext();
    if(! curContext.preserveSpace)this.skipSpace();
    this.state.start = this.state.pos;
    if (!this.isLookahead) this.state.startLoc = this.state.curPosition();
    if (this.state.pos >= this.length) {
      this.finishToken(tt.eof);
      return;
    }

    if (curContext === ct.template) {
      this.readTmplToken();
    } else {
      this.getTokenFromCode(this.codePointAtPos(this.state.pos)); }}}Copy the code

GetTokenFromCode evaluates the incoming code for periods, parentheses, commas, numbers, etc. In our case, it goes to default and calls isIdentifierStart to determine if it’s the start of the identifier. If so, call readWord to read a complete token.

getTokenFromCode(code: number): void {
  switch (code) {
    // The interpretation of a dot depends on whether it is followed
    // by a digit or another two dots.

    case charCodes.dot:
      this.readToken_dot();
      return;

    // Punctuation tokens.
    case charCodes.leftParenthesis:
      ++this.state.pos;
      this.finishToken(tt.parenL);
      return;
    case charCodes.rightParenthesis:
      ++this.state.pos;
      this.finishToken(tt.parenR);
      return;
    case charCodes.semicolon:
      ++this.state.pos;
      this.finishToken(tt.semi);
      return;
    case charCodes.comma:
      ++this.state.pos;
      this.finishToken(tt.comma);
      return;
    // ...

    default:
      if (isIdentifierStart(code)) {
        this.readWord(code);
        return; }}}Copy the code

The core of isIdentifierStart is to determine the Unicode encoding of a character:

  • ifcodeEncoding less than uppercaseAOf the code (65), onlycodeIt’s the $signtrue
  • ifcodeEncoding betweenA-ZBetween (65-90), thentrue
  • ifcodeCode in uppercaseZAnd a lowercaseaBetween (91-96), onlycodeIs the underline_Will only betrue
  • ifcodeEncoding betweena-zBetween (97-122), thentrue
  • ifcodeThe value is smaller than 65535 and larger than 170, and does not start with a special ASCII charactertrue
  • ifcodeThe number is greater than 65535isInAstralSetCheck if it is a valid identifier
export function isIdentifierStart(code: number) :boolean {
  if (code < charCodes.uppercaseA) return code === charCodes.dollarSign;
  if (code <= charCodes.uppercaseZ) return true;
  if (code < charCodes.lowercaseA) return code === charCodes.underscore;
  if (code <= charCodes.lowercaseZ) return true;
  if (code <= 0xffff) {
    return (
      code >= 0xaa && nonASCIIidentifierStart.test(String.fromCharCode(code))
    );
  }
  return isInAstralSet(code, astralIdentifierStartCodes);
}
Copy the code

Here our C can start as an identifier into readWord.

ReadWord, again in Tokenizer, is used to read a complete keyword or identifier.

It first calls readWord1, which determines by isIdentifierChar whether the current character can be used as an identifier to increment its position pos by one, moving backwards.

In our example, the next space where ch is const breaks out of the loop, and state.pos is 5, resulting in word being const.

Then determine the type of word and whether it is a keyword. KeywordTypes is a Map whose key is a variety of keywords or symbols and value is a number. Const is the keyword, and its value is 67. Then call tokenLabelName.

TokenLabelName gets the token tag based on type, and the tag for const is _const.

By the way, word is not a keyword. ReadWord treats Word as a user’s custom variable, such as a in this example.

// Read an identifier or keyword token. Will check for reserved
// words when necessary.
readWord(firstCode: number | void) :void {
  const word = this.readWord1(firstCode);
  const type = keywordTypes.get(word);
  if(type ! = =undefined) {
    // We don't use word as state.value here because word is a dynamic string
    // while token label is a shared constant string
    this.finishToken(type, tokenLabelName(type));
  } else {
    this.finishToken(tt.name, word);
  }
}

readWord1(firstCode: number | void): string {
  this.state.containsEsc = false;
  let word = "";
  const start = this.state.pos;
  let chunkStart = this.state.pos;
  if(firstCode ! = =undefined) {
    this.state.pos += firstCode <= 0xffff ? 1 : 2;
  }

  while (this.state.pos < this.length) {
    const ch = this.codePointAtPos(this.state.pos);
    if (isIdentifierChar(ch)) {
      this.state.pos += ch <= 0xffff ? 1 : 2;
    } else if (ch === charCodes.backslash) {
      this.state.containsEsc = true;

      word += this.input.slice(chunkStart, this.state.pos);
      const escStart = this.state.pos;
      const identifierCheck =
        this.state.pos === start ? isIdentifierStart : isIdentifierChar;

      if (this.input.charCodeAt(++this.state.pos) ! == charCodes.lowercaseU) {this.raise(this.state.pos, Errors.MissingUnicodeEscape);
        chunkStart = this.state.pos - 1;
        continue; } + +this.state.pos;
      const esc = this.readCodePoint(true);
      if(esc ! = =null) {
        if(! identifierCheck(esc)) {this.raise(escStart, Errors.EscapedCharNotAnIdentifier);
        }

        word += String.fromCodePoint(esc);
      }
      chunkStart = this.state.pos;
    } else {
      break; }}return word + this.input.slice(chunkStart, this.state.pos);
}
Copy the code

Then go to finishToken.

FinishToken sets the end, type, and value attributes because a complete token has been read, and the end location, type, and true value are available.

finishToken(type: TokenType, val: any) :void {
  this.state.end = this.state.pos;
  const prevType = this.state.type;
  this.state.type = type;
  this.state.value = val;

  if (!this.isLookahead) {
    this.state.endLoc = this.state.curPosition();
    this.updateContext(prevType); }}Copy the code

Go back to the Parse method of the Parser and enter

  1. parseTopLevel =>
  2. parseProgram =>
  3. parseBlockBody =>
  4. parseBlockOrModuleBlockBody =>
  5. parseStatement =>
  6. ParseStatementContent, which is parsing statements.
parseBlockOrModuleBlockBody(
  body: N.Statement[],
  directives: ?(N.Directive[]),
  topLevel: boolean,
  end: TokenType, afterBlockParse? :(hasStrictModeDirective: boolean) = > void,
): void {
  const oldStrict = this.state.strict;
  let hasStrictModeDirective = false;
  let parsedNonDirective = false;

  while (!this.match(end)) {
    const stmt = this.parseStatement(null, topLevel);

    if(directives && ! parsedNonDirective) {if (this.isValidDirective(stmt)) {
        const directive = this.stmtToDirective(stmt);
        directives.push(directive);

        if (
          !hasStrictModeDirective &&
          directive.value.value === "use strict"
        ) {
          hasStrictModeDirective = true;
          this.setStrict(true);
        }

        continue;
      }
      parsedNonDirective = true;
      // clear strict errors since the strict mode will not change within the block
      this.state.strictErrors.clear();
    }
    body.push(stmt);
  }

  if (afterBlockParse) {
    afterBlockParse.call(this, hasStrictModeDirective);
  }

  if(! oldStrict) {this.setStrict(false);
  }

  this.next();
}
Copy the code

ParseStatementContent parses different declarations based on startType, such as function/do/while/break, in this case const, into parseVarStatement.

export default class StatementParser extends ExpressionParser {
  parseStatement(context: ?string, topLevel? :boolean): N.Statement {
    if (this.match(tt.at)) {
      this.parseDecorators(true);
    }
    return this.parseStatementContent(context, topLevel);
  }

  parseStatementContent(context: ?string.topLevel:?boolean): N.Statement {
    let starttype = this.state.type;
    const node = this.startNode();
    let kind;

    if (this.isLet(context)) {
      starttype = tt._var;
      kind = "let";
    }

    // Most types of statements are recognized by the keyword they
    // start with. Many are trivial to parse, some require a bit of
    // complexity.

    switch (starttype) {
      case tt._break:
        return this.parseBreakContinueStatement(node, /* isBreak */ true);
      case tt._continue:
        return this.parseBreakContinueStatement(node, /* isBreak */ false);
      case tt._debugger:
        return this.parseDebuggerStatement(node);
      case tt._do:
        return this.parseDoStatement(node);
      case tt._for:
        return this.parseForStatement(node);
      case tt._function:
        // ...
      case tt._const:
      case tt._var:
        kind = kind || this.state.value;
        if(context && kind ! = ="var") {
          this.raise(this.state.start, Errors.UnexpectedLexicalDeclaration);
        }
        return this.parseVarStatement(node, kind);
      // ...
    }
  }

  parseVarStatement(
    node: N.VariableDeclaration,
    kind: "var" | "let" | "const",
  ): N.VariableDeclaration {
    this.next();
    this.parseVar(node, false, kind);
    this.semicolon();
    return this.finishNode(node, "VariableDeclaration"); }}Copy the code

ParseVarStatement in turn calls next => nextToken to parse the next token, in this case A.

next(): void {
  this.checkKeywordEscapes();
  if (this.options.tokens) {
    this.pushToken(new Token(this.state));
  }

  this.state.lastTokEnd = this.state.end;
  this.state.lastTokStart = this.state.start;
  this.state.lastTokEndLoc = this.state.endLoc;
  this.state.lastTokStartLoc = this.state.startLoc;
  this.nextToken();
}
Copy the code

ParseVarStatement will then go to parseVar, which will create a new node as a var declaration, and then call eat.

parseVar(
  node: N.VariableDeclaration,
  isFor: boolean,
  kind: "var" | "let" | "const",
): N.VariableDeclaration {
  const declarations = (node.declarations = []);
  const isTypescript = this.hasPlugin("typescript");
  node.kind = kind;
  for (;;) {
    const decl = this.startNode();
    this.parseVarId(decl, kind);
    if (this.eat(tt.eq)) {
      decl.init = isFor
        ? this.parseMaybeAssignDisallowIn()
        : this.parseMaybeAssignAllowIn();
    } else {
      if (
        kind === "const" &&
        !(this.match(tt._in) || this.isContextual(tt._of))
      ) {
        // `const` with no initializer is allowed in TypeScript.
        // It could be a declaration like `const x: number; `.
        if(! isTypescript) {this.raise(
            this.state.lastTokEnd,
            Errors.DeclarationMissingInitializer,
            "Const declarations",); }}else if( decl.id.type ! = ="Identifier" &&
        !(isFor && (this.match(tt._in) || this.isContextual(tt._of)))
      ) {
        this.raise(
          this.state.lastTokEnd,
          Errors.DeclarationMissingInitializer,
          "Complex binding patterns",); } decl.init =null;
    }
    declarations.push(this.finishNode(decl, "VariableDeclarator"));
    if (!this.eat(tt.comma)) break;
  }
  return node;
}
Copy the code

Eat method in Tokenizer, we call next => nextToken => getTokenFromCode => ReadNumber, which parses the next character, which is 1. Then returns true, enter parseMaybeAssignDisallowIn.

ParseMaybeAssignDisallowIn here no longer opened, and finally get a Node of the Node type for NumericLiteral type, and assign a value to the DCL. Init.

Declardeclar.push (this.finishNode(decl, “VariableDeclarator”)));

eat(type: TokenType): boolean {
  if (this.match(type)) {
    this.next();
    return true;
  } else {
    return false; }}Copy the code

Return to parseVarStatement, and finally call this.finishNode(node, “VariableDeclaration”) to finish parsing the node.

Back to parseBlockOrModuleBlockBody, SMT after the resolution will be in the body, then call next again = > nextToken, the state. The pos for 11, enclosing length is 11, FinishToken (tt.eof) is called.

Return to parseProgram and call finishNode(program, “program “) to end program parsing.

Return to parseTopLevel and call this.finishNode(file, “file “) to finish parsing file.

nextToken(): void {
  const curContext = this.curContext();
  if(! curContext.preserveSpace)this.skipSpace();
  this.state.start = this.state.pos;
  if (!this.isLookahead) this.state.startLoc = this.state.curPosition();
  if (this.state.pos >= this.length) {
    this.finishToken(tt.eof);
    return;
  }

  if (curContext === ct.template) {
    this.readTmplToken();
  } else {
    this.getTokenFromCode(this.codePointAtPos(this.state.pos)); }}Copy the code

Const a = 1 from @babel/parser

  1. newFileThe type ofNode, temporarily named asNodeA;
  2. newProgramThe type ofNode, namedNodeB;
  3. parsingtokenTo obtainconst;
  4. Enter theparserAnd into theparseVarStatement, the newly builtVariableDeclarationtheNode, namedNodeC;
  5. callnextLet’s do one moretoken,a;
  6. Enter theparseVar, the newly builtVariableDeclaratortheNode, namedNodeD;
  7. calleatBecause the next character is=, continue to callnextParse the next character1;
  8. Enter theparseMaybeAssignDisallowInTo obtainNumericLiteralThe type ofNode, namedNodeE;
  9. willNodeEAssigned toNodeDtheinitProperties;
  10. willNodeDAssigned toNodeCthedeclarationsProperties;
  11. willNodeCAssigned toNodeBthebodyProperties;
  12. willNodeBAssigned toNodeAtheprogramProperties;
  13. returnFileThe type ofNodeNode.

The AST for the simple example of const a = 1 (omitting loc/errors/comments, etc.) is posted below. You can also check it out in the AST Explorer.

{
  "type": "File"."start": 0."end": 11."program": {
    "type": "Program"."start": 0."end": 11."sourceType": "module"."interpreter": null."body": [{"type": "VariableDeclaration"."start": 0."end": 11."declarations": [{"type": "VariableDeclarator"."start": 6."end": 11."id": {
              "type": "Identifier"."start": 6."end": 7."name": "a"
            },
            "init": {
              "type": "NumericLiteral"."start": 10."end": 11."extra": {
                "rawValue": 1."raw": "1"
              },
              "value": 1}}]."kind": "const"}],}.}Copy the code

Flow chart of 3.

Below is a flow chart for further understanding.

Five, the summary

From the above example, it can be seen that:

  1. Lexical analysis and grammatical analysis, as many articles say, is not as distinct as the whole firsttoken, and thentokenParse to AST, but get the first onetokenAfter the lexical parsing, that is, will be maintained at the same timeNodeandToken. It’s going to be passed over and over againnextTokenMethod to get the nextToken.
  2. startNode,finishNode.startToken,finishTokenIt comes in pairs and can be understood as a stack or onion model. To enter aNodeIt will pass in the endfinishNoderighttype/endAnd other attributes are assigned,tokenSame thing.

This article mainly plays the role of casting bricks to attract jade. For those of you interested in @babel/ Parser, read the source code in depth, and if you have any questions, feel free to explore.

Six, series of articles

  1. Babel basis
  2. @babel/core
  3. Babel source parser @babel/parser
  4. Babel traverse @babel/traverse
  5. @babel/generator

Vii. Relevant materials

  1. the-super-tiny-compiler
  2. babel-handbook
  3. loose-mode