The purpose of this article is to try to provide the translation of the core terms in the ECMAScript specification for the evaluation of the peers.

V8. Dev /blog/unders…

This time we dive into the definition of the ECMAScript language and its syntax. If you are not familiar with context-free grammar, you should take a refresher course to at least understand the basic concepts. Because the specification uses a context-free grammar definition language.

ECMAScript grammar

The ECMAScript specification defines four grammars.

  1. Lexical grammar: Describes how to translate Unicode code points into sequences of input elements (tags, line terminators, comments, whitespace).
  2. Syntax Grammar: Defines how tokens constitute syntactically correct programs.
  3. Regular grammar: Describes how to translate Unicode code points into regular expressions.
  4. Numeric String grammar: Describes how to translate strings into numeric values.

Each grammar is defined by a context-free grammar and contains a set of productions.

Different grammars use different representations. The syntactic grammar is expressed as LeftHandSideSymbol:, the lexical and regular grammar as LeftHandSideSymbol ::, and the numeric string grammar as LeftHandSideSymbol :::. (By the number of colons. (Translator’s Note)

Let’s take a closer look at lexical and grammatical grammar.

The lexical grammar

The specification defines the ECMAScript source text as a sequence of Unicode code points. This means that variable names are not limited to ASCII characters, but can also contain other Unicode characters. The specification does not talk about the actual encoding (such as UTF-8 or UTF-16), but assumes that the source code has been converted to a sequence of Unicode code points in its own encoding.

The inability to tokenize ECMAScript source code in advance makes defining the lexical grammar a little more complicated. For example, it is impossible to determine whether/is the division operator or the beginning of a regular expression without looking at its larger context.

const x = 10 / 5;

Copy the code

Here/is DivPunctuator.

const r = /foo/;

Copy the code

Here/is the beginning of RegularExpressionLiteral.

The template introduces a similar ambiguity: the interpretation of} ‘depends on the context in which it occurs:

const what1 = 'temp';
const what2 = 'late';
const t = \`I am a ${ what1 + what2 }\`;

Copy the code

Here I am a ${is TemplateHead and} ‘is TemplateTail.

if (0 == 1) {
}\`not very useful\`;

Copy the code

Here} is the RightBracePunctuator, and ‘is the beginning of NoSubstitutionTemplate.

Even though the interpretation of/and} ‘depends on context (where they are in the syntax structure of your code), the grammars we present below are context-free.

Vocabulary grammars use goal symbols to distinguish which input elements are allowed and which are not in which context. For example, the target symbol is InputElementDiv (notice that Div stands for Divide, or division). Is used in contexts where/is division and /= is division assignment. The InputElementDiv production lists the tags that can be generated in this context:

InputElementDiv ::
  WhiteSpace
  LineTerminator
  Comment
  CommonToken
  DivPunctuator
  RightBracePunctuator

Copy the code

In this context, a DivPunctuator input element is encountered/generated, but a RegularExpressionLiteral is not generated.

Accordingly, for contexts where/is the beginning of a regular expression, the target symbol is InputElementRegExp:

InputElementRegExp ::
  WhiteSpace
  LineTerminator
  Comment
  CommonToken
  RightBracePunctuator
  RegularExpressionLiteral

Copy the code

This production can produce a RegularExpressionLiteral input element, but not a Divator.

Similarly, target symbol InputElementRegExpOrTemplateTail corresponding context in addition to RegularExpressionLiteral, also allows TemplateMiddle and TemplateTailt. The last InputElementTemplateTail target symbol context allows only TemplateMiddle and TemplateTail, not a RegularExpressionLiteral.

In implementation, a grammar parser (” parser “) can call a lexical parser (” marker “or” tokenizer “), pass the target symbol as a parameter, and request the next input element that fits the target symbol.

The syntax of grammar

A lexical grammar defines how to build tags from Unicode code points. Grammars are built on top of it, defining how tags constitute syntactically correct programs.

Example: Allow legacy identifiers

Adding a new keyword to a grammar can be disruptive: What if the existing code already uses that keyword as an identifier?

For example, when await is not yet a keyword, code like this might appear:

function old() {
  var await;
}

Copy the code

The ECMAScript grammar carefully adds the await keyword so that the code can continue to work. In async functions, await is a keyword, so we cannot say:

async function modern() { var await; // syntax error}Copy the code

Yield is allowed in non-generators, but not similar in generators.

To understand how to allow await as an identifier, you need to understand the ecMAScript-specific syntactic representation.

Production and shorthand

Let’s see how the production of a VariableStatement is defined. At first glance, this grammar is a little scary:

VariableStatement\[Yield, Await\] :
  var VariableDeclarationList\[+In, ?Yield, ?Await\] ;


Copy the code

Here the subscript ([Yield, Await]) and the prefix (+ and In +? In Await?) What do they mean?

This notation is explained in “grammatical notation”.

The subscript is shorthand for a set of productions, expressing the left hand side of a set of productions at one time. The production left notation takes two arguments that can be expanded to four “real” production left notations:

  • VariableStatement
  • VariableStatement_Yield
  • VariableStatement_Await
  • VariableStatement_Yield_Await

Note that the above VariableStatement is just a VariableStatement, with no _Await and no _Yield. This should not be confused with _VariableStatement_[Yield, Await] (short form).

On the right side of the production, you see the shorthand +In, which means “use version with _In”, and? Await means “use the version with _Await if and only if the left end symbol has _Await” (? Yield is similar).

The third shorthand, ~Foo, means “use the version without _Foo” (not present in this production).

Knowing this, we can expand the above production equation like this:

VariableStatement :
  var VariableDeclarationList\_In ;

VariableStatement\_Yield :
  var VariableDeclarationList\_In\_Yield ;

VariableStatement\_Await :
  var VariableDeclarationList\_In\_Await ;

VariableStatement\_Yield\_Await :
  var VariableDeclarationList\_In\_Yield\_Await ;


Copy the code

Finally, there are two things to be clear about.

  1. Where do we determine where we are_AwaitOr no_AwaitIn the case of?
  2. What’s the difference between having it and not having it, or whatSomething_AwaitThe sum of the production expressions ofSomething(no_AwaitWhere does the production of) bifurcation?

_AwaitThere is still no_Await

Let’s solve the first problem. Because it’s easy to guess that asynchronous and non-asynchronous functions can be distinguished by whether or not the function body has _Await. Looking at the generation of asynchronous function declarations, we find this:

AsyncFunctionBody :
  FunctionBody\[~Yield, +Await\]


Copy the code

Note that AsyncFunctionBody has no arguments, which are added to FunctionBody on the right side. Expanding this production formula yields:

AsyncFunctionBody :
  FunctionBody\_Await


Copy the code

In other words, asynchronous functions have FunctionBody_Await (that is, an await is treated as a keyword) function body.

On the other hand, in a non-asynchronous function, the correlation production is:

FunctionDeclaration\[Yield, Await, Default\] :
  function BindingIdentifier\[?Yield, ?Await\] ( FormalParameters\[~Yield, ~Await\] ) { FunctionBody\[~Yield, ~Await\] }


Copy the code

(FunctionDeclaration also has a production that is irrelevant to our code example.)

To avoid combinatorial expansion, we ignore the Default argument because it is not used in this particular production. Thus, the expansion form of this production formula is:

FunctionDeclaration :
  function BindingIdentifier ( FormalParameters) { FunctionBody }

FunctionDeclaration\_Yield :
  function BindingIdentifier\_Yield ( FormalParameters) { FunctionBody }

FunctionDeclaration\_Await :
  function BindingIdentifier\_Await ( FormalParameters) { FunctionBody }

FunctionDeclaration\_Yield\_Await :
  function BindingIdentifier\_Yield\_Await ( FormalParameters) { FunctionBody }


Copy the code

In this production, there are only FunctionBody and FormalParameters (without _Yield and without _Await), because they both have arguments [~Yield, ~Await] in the unexpanded production.

Function names are treated differently: if the left end of the production contains _Yield and _Await, the right end of the function name takes arguments.

Summary: Asynchronous functions have FunctionBody_Await, while non-asynchronous functions have FunctionBody (without _Await). Because we are talking about non-generator functions, neither asynchronous nor non-asynchronous sample functions take the _Yield argument.

It may be difficult to remember which is FunctionBody and which is FunctionBody_Await. In a function body with FunctionBody_Await, is await an identifier or a keyword?

The _Await parameter can be interpreted to mean “await is a keyword.” This understanding will not be a problem in the future. Suppose a blob keyword is added in the future, but only for the blobby function. Non-mottled, non-asynchronous, non-generator functions still have FunctionBody (without _Yield, _Await, or _Blob), exactly as they do now. Dappled functions include FunctionBody_Await_Blob, etc. Although Blob subscripts are still added to the production, the FunctionBody of the existing function is expanded as before.

Don’t allowawaitUsed as an identifier

Next, we need to figure out why await is not allowed as an identifier in FunctionBody_Await.

If you look closely at the production, you can see that the FunctionBody and previous VariableStatement production all take the _Await argument. Thus, in asynchronous functions, you have VariableStatement_Await, and in non-asynchronous functions, you have VariableStatement.

Look at it a little bit more carefully. Notice the parameters. We have already seen the production of this VariableStatement:

VariableStatement\[Yield, Await\] :
  var VariableDeclarationList\[+In, ?Yield, ?Await\] ;


Copy the code

All VariableDeclarationList productions take these parameters as well:

VariableDeclarationList\[In, Yield, Await\] :
  VariableDeclaration\[?In, ?Yield, ?Await\]


Copy the code

(Only the production that is relevant to our example is shown here.)

VariableDeclarationList\[In, Yield, Await\] :
  BindingIdentifier\[?Yield, ?Await\] Initializer\[?In, ?Yield, ?Await\]opt ;


Copy the code

The opt abbreviation here means that the right side of the production is optional, that is, there are actually two production: one with the optional symbol (Initializer) and one without.

For our simple example, the VariableStatement contains the keyword var, followed by a BindingIdentifier (no Initializer), and ends with a semicolon.

In order to allow or not allow await as a BindingIdentifier, we would like to end up with this:

BindingIdentifier\_Await :
  Identifier
  yield

BindingIdentifier :
  Identifier
  yield
  await


Copy the code

This means that await is not allowed as an identifier in asynchronous functions and it is allowed as an identifier in non-asynchronous functions.

In fact, there is no such definition in the specification, and what we find is this production:

BindingIdentifier\[Yield, Await\] :
  Identifier
  yield
  await


Copy the code

After expansion, we get:

BindingIdentifier\_Await :
  Identifier
  yield
  await

BindingIdentifier :
  Identifier
  yield
  await


Copy the code

(The production of BindingIdentifier_Yield and BindingIdentifier_Yield_Await are omitted because our example does not.)

See await and yield as identifiers at any time. What’s going on here? Was this article written for nothing?

Static semantics

It turns out that static semantics are also needed in order to disallow await as an identifier in asynchronous functions.

Static semantics describe static rules, that is, rules to be validated before a program is run.

For our example, the static semantics of the BindingIdentifier define the following syntactically oriented rules:

BindingIdentifier\[Yield, Await\] : await


Copy the code
  • If this production has theta[Await]A parameter is a Syntax Error.

In effect, this disallows BindingIdentifier_Await: await generation.

The specification explains that it is a syntax error because it conflicts with Automatic Semicolon Insertion (ASI) in static semantics.

We know that ASI steps in when a line of code cannot be parsed based on syntactic production. ASI attempts to add semicolons to satisfy the requirement that statements and declarations must end with a semicolon. (See the next article for more on ASI.)

Take a look at the following code (an example from the specification) :

async function too\_few\_semicolons() {
  let
  await 0;
}

Copy the code

If the grammar does not allow await as an identifier, ASI will step in and convert the above code to the following syntactically correct code, in which case let will also be treated as an identifier:

async function too\_few\_semicolons() {
  let;
  await 0;
}

Copy the code

This conflict with ASI was considered too confusing to allow await as an identifier with static semantics.

Identifiers are not allowedStringValues

Here’s another related rule:

BindingIdentifier : Identifier


Copy the code
  • If this production has theta[Await]Parameters, andIdentifiertheStringValueis"await"“Is a Syntax Error.

It’s hard to understand at first glance. Identifiers are defined as follows:

 Identifier  : 
  IdentifierName but not ReservedWord


Copy the code

Await is a ReservedWord, so how can Identifier be await?

Indeed, Identifier cannot be await, but can be some other value where StringValue is “await” (a different representation of await in a character sequence).

The static semantics of the identifier name define how to evaluate the StringValue of the identifier. For example, the Unicode escape sequence of A is \u0061, so \u0061wait’s StringValue is “await”. \ u0061Wait is not recognized as a keyword by lexical syntax, but as Identifier. Static semantics prohibit using it as a variable name in asynchronous functions.

Therefore, this can:

function old() {
  var \\u0061wait;
}

Copy the code

But this won’t do:

async function modern() { var \\u0061wait; // syntax error}Copy the code

summary

By studying this article, we have learned about lexical grammar, grammatical grammar, and the shorthand forms used to define grammatical grammar. As an example, we looked at await being disallowed as an identifier in asynchronous functions, but allowed in non-asynchronous functions.

The next article will cover other interesting parts of lexical grammar, such as automatic insertion semicolons (ASI) and cover grammar.