A, start
In the last article, we introduced @babel/ Core. This time, we will analyze @babel/ Parser.
@babel/ Parser can be divided into two parts: lexical analysis and grammar analysis. Let’s see how they work and how they are combined.
2. API description
Previous versions of Babel called the Recast library directly to generate the AST. Now @babel/ Parser borrows a lot from Acorn.
Before looking at the source code, you can read the documentation first, and have a basic understanding of its functions and APIS.
The @babel/ Parser API mainly exposes parse and parseExpression, in the form of:
babelParser.parse(code, [options])
Copy the code
Look at the sourceType parameter in options. The values can be script, module, or unambiguous, and the default is script.
SourceType is judged by the presence of the import/export keyword if unambiguous, and updated to module if import/export is present, and script otherwise.
Third, source code analysis
The @babel/ Parser version for this analysis is v7.16.4.
1. Directory structure
Take a look at its directory structure:
- parser
- base
- comment
- error
- expression
- index
- lval
- node
- statement
- plugins
- flow
- jsx
- typescript
- tokenizer
- context
- index
- state
- utils
- identifier
- location
- scope
- index
Copy the code
You can see that it consists of three main parts: Tokenizer, plugins, and Parser. Among them
tokenizer
Used to parseToken
, includingstate
andcontext
;parser
containsnode
,statement
,expression
And so on;plugins
Is used to help parsets
,jsx
And so on grammar.
From the index.js perspective, when the parser.parse method is called, a parser is instantiated, which is the parser from the parser/index.js file above. The Parser implementation is interesting in that it has a long inheritance chain:
Why organize your code this way? The advantage of this inheritance chain is that you only need to call this to get the methods of other objects, and you can maintain a common data, or state, such as Node nodes, state, context context, scope, etc.
2. Operation mechanism
Here’s a very simple example of how parser works:
const { parse } = require('@babel/parser')
parse('const a = 1', {})
Copy the code
When calling the @babel/parser parse method, it instantiates a Parser using getParser and then calls the parse method on the parser.
function getParser(options: ? Options, input: string) :Parser {
let cls = Parser;
if(options? .plugins) { validatePlugins(options.plugins); cls = getParserClass(options.plugins); }return new cls(options, input);
}
Copy the code
Due to the inheritance of Parser, a series of initialization operations, including state, scope, and context, are performed in each class during instantiation.
We then enter the parse method, where the main logic consists of:
- Enter the scope of the initialization
- through
startNode
createFile
Nodes andProgram
node - through
nextToken
To get onetoken
- through
parseTopLevel
Recurse and parse.
export default class Parser extends StatementParser {
parse(): File {
this.enterInitialScopes();
const file = this.startNode();
const program = this.startNode();
this.nextToken();
file.errors = null;
this.parseTopLevel(file, program);
file.errors = this.state.errors;
returnfile; }}Copy the code
StartNode is a new Node in the class NodeUtil. The top Node of the AST is of type File, and it has a property program. Another Node needs to be created. The two nodes have the same start and LOC information.
export class NodeUtils extends UtilParser {
startNode<T: NodeType>(): T {
return new Node(this.this.state.start, this.state.startLoc); }}Copy the code
In Tokenizer, the main logic is to determine whether the current position is longer than the input length. If so, call finishToken to terminate the token, otherwise call getTokenFromCode or readTmplToken. To read the token from code.
Here ct.template starts with a backquote, and in our case const a = 1 goes to getTokenFromCode.
This.codepointatpos (this.state.pos) gets the current character, in our case C.
export default class Tokenizer extends ParserErrors {
nextToken(): void {
const curContext = this.curContext();
if(! curContext.preserveSpace)this.skipSpace();
this.state.start = this.state.pos;
if (!this.isLookahead) this.state.startLoc = this.state.curPosition();
if (this.state.pos >= this.length) {
this.finishToken(tt.eof);
return;
}
if (curContext === ct.template) {
this.readTmplToken();
} else {
this.getTokenFromCode(this.codePointAtPos(this.state.pos)); }}}Copy the code
GetTokenFromCode evaluates the incoming code for periods, parentheses, commas, numbers, etc. In our case, it goes to default and calls isIdentifierStart to determine if it’s the start of the identifier. If so, call readWord to read a complete token.
getTokenFromCode(code: number): void {
switch (code) {
// The interpretation of a dot depends on whether it is followed
// by a digit or another two dots.
case charCodes.dot:
this.readToken_dot();
return;
// Punctuation tokens.
case charCodes.leftParenthesis:
++this.state.pos;
this.finishToken(tt.parenL);
return;
case charCodes.rightParenthesis:
++this.state.pos;
this.finishToken(tt.parenR);
return;
case charCodes.semicolon:
++this.state.pos;
this.finishToken(tt.semi);
return;
case charCodes.comma:
++this.state.pos;
this.finishToken(tt.comma);
return;
// ...
default:
if (isIdentifierStart(code)) {
this.readWord(code);
return; }}}Copy the code
The core of isIdentifierStart is to determine the Unicode encoding of a character:
- if
code
Encoding less than uppercaseA
Of the code (65), onlycode
It’s the $signtrue
- if
code
Encoding betweenA-Z
Between (65-90), thentrue
- if
code
Code in uppercaseZ
And a lowercasea
Between (91-96), onlycode
Is the underline_
Will only betrue
- if
code
Encoding betweena-z
Between (97-122), thentrue
- if
code
The value is smaller than 65535 and larger than 170, and does not start with a special ASCII charactertrue
- if
code
The number is greater than 65535isInAstralSet
Check if it is a valid identifier
export function isIdentifierStart(code: number) :boolean {
if (code < charCodes.uppercaseA) return code === charCodes.dollarSign;
if (code <= charCodes.uppercaseZ) return true;
if (code < charCodes.lowercaseA) return code === charCodes.underscore;
if (code <= charCodes.lowercaseZ) return true;
if (code <= 0xffff) {
return (
code >= 0xaa && nonASCIIidentifierStart.test(String.fromCharCode(code))
);
}
return isInAstralSet(code, astralIdentifierStartCodes);
}
Copy the code
Here our C can start as an identifier into readWord.
ReadWord, again in Tokenizer, is used to read a complete keyword or identifier.
It first calls readWord1, which determines by isIdentifierChar whether the current character can be used as an identifier to increment its position pos by one, moving backwards.
In our example, the next space where ch is const breaks out of the loop, and state.pos is 5, resulting in word being const.
Then determine the type of word and whether it is a keyword. KeywordTypes is a Map whose key is a variety of keywords or symbols and value is a number. Const is the keyword, and its value is 67. Then call tokenLabelName.
TokenLabelName gets the token tag based on type, and the tag for const is _const.
By the way, word is not a keyword. ReadWord treats Word as a user’s custom variable, such as a in this example.
// Read an identifier or keyword token. Will check for reserved
// words when necessary.
readWord(firstCode: number | void) :void {
const word = this.readWord1(firstCode);
const type = keywordTypes.get(word);
if(type ! = =undefined) {
// We don't use word as state.value here because word is a dynamic string
// while token label is a shared constant string
this.finishToken(type, tokenLabelName(type));
} else {
this.finishToken(tt.name, word);
}
}
readWord1(firstCode: number | void): string {
this.state.containsEsc = false;
let word = "";
const start = this.state.pos;
let chunkStart = this.state.pos;
if(firstCode ! = =undefined) {
this.state.pos += firstCode <= 0xffff ? 1 : 2;
}
while (this.state.pos < this.length) {
const ch = this.codePointAtPos(this.state.pos);
if (isIdentifierChar(ch)) {
this.state.pos += ch <= 0xffff ? 1 : 2;
} else if (ch === charCodes.backslash) {
this.state.containsEsc = true;
word += this.input.slice(chunkStart, this.state.pos);
const escStart = this.state.pos;
const identifierCheck =
this.state.pos === start ? isIdentifierStart : isIdentifierChar;
if (this.input.charCodeAt(++this.state.pos) ! == charCodes.lowercaseU) {this.raise(this.state.pos, Errors.MissingUnicodeEscape);
chunkStart = this.state.pos - 1;
continue; } + +this.state.pos;
const esc = this.readCodePoint(true);
if(esc ! = =null) {
if(! identifierCheck(esc)) {this.raise(escStart, Errors.EscapedCharNotAnIdentifier);
}
word += String.fromCodePoint(esc);
}
chunkStart = this.state.pos;
} else {
break; }}return word + this.input.slice(chunkStart, this.state.pos);
}
Copy the code
Then go to finishToken.
FinishToken sets the end, type, and value attributes because a complete token has been read, and the end location, type, and true value are available.
finishToken(type: TokenType, val: any) :void {
this.state.end = this.state.pos;
const prevType = this.state.type;
this.state.type = type;
this.state.value = val;
if (!this.isLookahead) {
this.state.endLoc = this.state.curPosition();
this.updateContext(prevType); }}Copy the code
Go back to the Parse method of the Parser and enter
- parseTopLevel =>
- parseProgram =>
- parseBlockBody =>
- parseBlockOrModuleBlockBody =>
- parseStatement =>
- ParseStatementContent, which is parsing statements.
parseBlockOrModuleBlockBody(
body: N.Statement[],
directives: ?(N.Directive[]),
topLevel: boolean,
end: TokenType, afterBlockParse? :(hasStrictModeDirective: boolean) = > void,
): void {
const oldStrict = this.state.strict;
let hasStrictModeDirective = false;
let parsedNonDirective = false;
while (!this.match(end)) {
const stmt = this.parseStatement(null, topLevel);
if(directives && ! parsedNonDirective) {if (this.isValidDirective(stmt)) {
const directive = this.stmtToDirective(stmt);
directives.push(directive);
if (
!hasStrictModeDirective &&
directive.value.value === "use strict"
) {
hasStrictModeDirective = true;
this.setStrict(true);
}
continue;
}
parsedNonDirective = true;
// clear strict errors since the strict mode will not change within the block
this.state.strictErrors.clear();
}
body.push(stmt);
}
if (afterBlockParse) {
afterBlockParse.call(this, hasStrictModeDirective);
}
if(! oldStrict) {this.setStrict(false);
}
this.next();
}
Copy the code
ParseStatementContent parses different declarations based on startType, such as function/do/while/break, in this case const, into parseVarStatement.
export default class StatementParser extends ExpressionParser {
parseStatement(context: ?string, topLevel? :boolean): N.Statement {
if (this.match(tt.at)) {
this.parseDecorators(true);
}
return this.parseStatementContent(context, topLevel);
}
parseStatementContent(context: ?string.topLevel:?boolean): N.Statement {
let starttype = this.state.type;
const node = this.startNode();
let kind;
if (this.isLet(context)) {
starttype = tt._var;
kind = "let";
}
// Most types of statements are recognized by the keyword they
// start with. Many are trivial to parse, some require a bit of
// complexity.
switch (starttype) {
case tt._break:
return this.parseBreakContinueStatement(node, /* isBreak */ true);
case tt._continue:
return this.parseBreakContinueStatement(node, /* isBreak */ false);
case tt._debugger:
return this.parseDebuggerStatement(node);
case tt._do:
return this.parseDoStatement(node);
case tt._for:
return this.parseForStatement(node);
case tt._function:
// ...
case tt._const:
case tt._var:
kind = kind || this.state.value;
if(context && kind ! = ="var") {
this.raise(this.state.start, Errors.UnexpectedLexicalDeclaration);
}
return this.parseVarStatement(node, kind);
// ...
}
}
parseVarStatement(
node: N.VariableDeclaration,
kind: "var" | "let" | "const",
): N.VariableDeclaration {
this.next();
this.parseVar(node, false, kind);
this.semicolon();
return this.finishNode(node, "VariableDeclaration"); }}Copy the code
ParseVarStatement in turn calls next => nextToken to parse the next token, in this case A.
next(): void {
this.checkKeywordEscapes();
if (this.options.tokens) {
this.pushToken(new Token(this.state));
}
this.state.lastTokEnd = this.state.end;
this.state.lastTokStart = this.state.start;
this.state.lastTokEndLoc = this.state.endLoc;
this.state.lastTokStartLoc = this.state.startLoc;
this.nextToken();
}
Copy the code
ParseVarStatement will then go to parseVar, which will create a new node as a var declaration, and then call eat.
parseVar(
node: N.VariableDeclaration,
isFor: boolean,
kind: "var" | "let" | "const",
): N.VariableDeclaration {
const declarations = (node.declarations = []);
const isTypescript = this.hasPlugin("typescript");
node.kind = kind;
for (;;) {
const decl = this.startNode();
this.parseVarId(decl, kind);
if (this.eat(tt.eq)) {
decl.init = isFor
? this.parseMaybeAssignDisallowIn()
: this.parseMaybeAssignAllowIn();
} else {
if (
kind === "const" &&
!(this.match(tt._in) || this.isContextual(tt._of))
) {
// `const` with no initializer is allowed in TypeScript.
// It could be a declaration like `const x: number; `.
if(! isTypescript) {this.raise(
this.state.lastTokEnd,
Errors.DeclarationMissingInitializer,
"Const declarations",); }}else if( decl.id.type ! = ="Identifier" &&
!(isFor && (this.match(tt._in) || this.isContextual(tt._of)))
) {
this.raise(
this.state.lastTokEnd,
Errors.DeclarationMissingInitializer,
"Complex binding patterns",); } decl.init =null;
}
declarations.push(this.finishNode(decl, "VariableDeclarator"));
if (!this.eat(tt.comma)) break;
}
return node;
}
Copy the code
Eat method in Tokenizer, we call next => nextToken => getTokenFromCode => ReadNumber, which parses the next character, which is 1. Then returns true, enter parseMaybeAssignDisallowIn.
ParseMaybeAssignDisallowIn here no longer opened, and finally get a Node of the Node type for NumericLiteral type, and assign a value to the DCL. Init.
Declardeclar.push (this.finishNode(decl, “VariableDeclarator”)));
eat(type: TokenType): boolean {
if (this.match(type)) {
this.next();
return true;
} else {
return false; }}Copy the code
Return to parseVarStatement, and finally call this.finishNode(node, “VariableDeclaration”) to finish parsing the node.
Back to parseBlockOrModuleBlockBody, SMT after the resolution will be in the body, then call next again = > nextToken, the state. The pos for 11, enclosing length is 11, FinishToken (tt.eof) is called.
Return to parseProgram and call finishNode(program, “program “) to end program parsing.
Return to parseTopLevel and call this.finishNode(file, “file “) to finish parsing file.
nextToken(): void {
const curContext = this.curContext();
if(! curContext.preserveSpace)this.skipSpace();
this.state.start = this.state.pos;
if (!this.isLookahead) this.state.startLoc = this.state.curPosition();
if (this.state.pos >= this.length) {
this.finishToken(tt.eof);
return;
}
if (curContext === ct.template) {
this.readTmplToken();
} else {
this.getTokenFromCode(this.codePointAtPos(this.state.pos)); }}Copy the code
Const a = 1 from @babel/parser
- new
File
The type ofNode
, temporarily named asNodeA
; - new
Program
The type ofNode
, namedNodeB
; - parsing
token
To obtainconst
; - Enter the
parser
And into theparseVarStatement
, the newly builtVariableDeclaration
theNode
, namedNodeC
; - call
next
Let’s do one moretoken
,a
; - Enter the
parseVar
, the newly builtVariableDeclarator
theNode
, namedNodeD
; - call
eat
Because the next character is=
, continue to callnext
Parse the next character1
; - Enter the
parseMaybeAssignDisallowIn
To obtainNumericLiteral
The type ofNode
, namedNodeE
; - will
NodeE
Assigned toNodeD
theinit
Properties; - will
NodeD
Assigned toNodeC
thedeclarations
Properties; - will
NodeC
Assigned toNodeB
thebody
Properties; - will
NodeB
Assigned toNodeA
theprogram
Properties; - return
File
The type ofNode
Node.
The AST for the simple example of const a = 1 (omitting loc/errors/comments, etc.) is posted below. You can also check it out in the AST Explorer.
{
"type": "File"."start": 0."end": 11."program": {
"type": "Program"."start": 0."end": 11."sourceType": "module"."interpreter": null."body": [{"type": "VariableDeclaration"."start": 0."end": 11."declarations": [{"type": "VariableDeclarator"."start": 6."end": 11."id": {
"type": "Identifier"."start": 6."end": 7."name": "a"
},
"init": {
"type": "NumericLiteral"."start": 10."end": 11."extra": {
"rawValue": 1."raw": "1"
},
"value": 1}}]."kind": "const"}],}.}Copy the code
Flow chart of 3.
Below is a flow chart for further understanding.
Five, the summary
From the above example, it can be seen that:
- Lexical analysis and grammatical analysis, as many articles say, is not as distinct as the whole first
token
, and thentoken
Parse to AST, but get the first onetoken
After the lexical parsing, that is, will be maintained at the same timeNode
andToken
. It’s going to be passed over and over againnextToken
Method to get the nextToken
. startNode
,finishNode
.startToken
,finishToken
It comes in pairs and can be understood as a stack or onion model. To enter aNode
It will pass in the endfinishNode
righttype/end
And other attributes are assigned,token
Same thing.
This article mainly plays the role of casting bricks to attract jade. For those of you interested in @babel/ Parser, read the source code in depth, and if you have any questions, feel free to explore.
Six, series of articles
- Babel basis
- @babel/core
- Babel source parser @babel/parser
- Babel traverse @babel/traverse
- @babel/generator
Vii. Relevant materials
- the-super-tiny-compiler
- babel-handbook
- loose-mode