• Column address: Front-end compilation and engineering
  • Jouryjc

Know the PEG. Js

what?

Peg.js is a simple JavaScript parser generator that produces fast parsers with excellent error reporting. You can use it to work with complex data or computer languages and easily build converters, interpreters, compilers, and other tools.

Let’s start with a quick look at the definitions of interpreters and compilers:

An interpreter is a computer program that interprets and executes an interpreted language. The interpreter acts as a middleman. The interpreter interprets line by line as it executes, so programs that rely on the interpreter run slowly. The advantage of the interpreter is that it does not need to recompile the entire program, reducing the burden of compiling after each program update.

A compiler is a computer program that converts the source code written in one programming language (the source language) into another programming language (the target language).

The main differences between the two are:

  • A compiler translates a program as a whole, while an interpreter translates line by line;
  • Generate intermediate or object code in the case of a compiler. The interpreter does not create intermediate code;
  • Compilers are much faster than interpreters because the compiler does the whole program at once, whereas the interpreter compiles each line of code in turn.
  • Since the compiler needs more memory than the interpreter to generate object code;
  • In the compiler, when an error occurs in a program, it stops the translation and retranslates the entire program after removing the error. Instead, when an error occurs in the interpreter, it blocks its translation, and after the error is removed, the translation continues;
  • Compilers are used in programming languages such asc,c++,c#,ScalaAnd so on. Another interpreter is used forPHP,Ruby,Python,JavaScriptSuch as language.

How?

Peg. js is available in node and browser environments and is installed just like a normal package:

#Use the CLI to generate the compiler
npm install -g pegjs

#Choose local installation when generating the compiler through the JavaScript API because you are importing packages
npm install pegjs
Copy the code

This article only demonstrates the use of CLI to generate the compiler, JavaScript API is described in the official documentation, the parameters are consistent. Create a new simple-arithmetics. Pegjs file and write the rules of the official DEMO:

// Simple Arithmetics Grammar
/ / = = = = = = = = = = = = = = = = = = = = = = = = = =
//
// Accepts expressions like "2 * (3 + 4)" and computes their value.

Expression
  = head:Term tail:(_ ("+" / "-") _ Term)* {
      return tail.reduce(function(result, element) {
        if (element[1= = ="+") { return result + element[3]; }
        if (element[1= = ="-") { return result - element[3]; }
      }, head);
    }

Term
  = head:Factor tail:(_ ("*" / "/") _ Factor)* {
      return tail.reduce(function(result, element) {
        if (element[1= = ="*") { return result * element[3]; }
        if (element[1= = ="/") { return result / element[3]; }
      }, head);
    }

Factor
  = "(" _ expr:Expression _ ")" { return expr; }
  / Integer

Integer "integer"
  = _ [0-9] + {return parseInt(text(), 10); }

_ "whitespace"
  = [ \t\n\r]*
Copy the code

Then execute the following command:

pegjs simple-arithmetics.pegjs
Copy the code

It generates a simple-arithmetics. Js file in the current directory:

/* * Generated by peg.js 0.10.0. * * http://pegjs.org/ */

"use strict";

function peg$subclass(child, parent) {
  function ctor() { this.constructor = child; }
  ctor.prototype = parent.prototype;
  child.prototype = new ctor();
}

function peg$SyntaxError(message, expected, found, location) {
  // ...
}

peg$subclass(peg$SyntaxError, Error);

peg$SyntaxError.buildMessage = function(expected, found) {
   // ...
};

function peg$parse(input, options) {
  // ...
}

module.exports = {
  SyntaxError: peg$SyntaxError,
  parse:       peg$parse
};

Copy the code

The parse and SyntaxError functions are exported using CJS. Let’s create a new test.js file that references the compiler to parse our expression:

const { parse } = require('./simple-arithmetics')
console.log(parse('2 * (3 + 4)'))		/ / 14
Copy the code

At this point, a compiler that supports simple arithmetic operations is complete. Before we dive into the syntax and lexical parsing, let’s take a look at the output compiler arguments:

–allowed-start-rules

The default starts parsing with Grammer’s first rule. The parameter format is an array, which is used in CLI to connect multiple rule beginning names. For example, we have the following Grammer definition:

middle
 = end The '*'

start
 = [a-z] middle

end
 = [1-9]
Copy the code

If we generate parser and do not pass –allowed-start-rules, execute the following command:

pegjs ./simple-arithmetics.pegjs
Copy the code

The generated parser will use middle as the syntax entry, so let’s test it:

const { parse } = require('./simple-arithmetics')
console.log(parse('1 *'))  // ['1', '*']
console.log(parse('a1*'))	// peg$SyntaxError: Expected [1-9] but "a" found.
Copy the code

As you can see, parsing starts at the first line (the MIDDLE rule). If we want to meet our expectations, such as start-middle-end order, we can add the –allowed-start-rules parameter and specify start:

pegjs --allowed-start-rules start ./simple-arithmetics.pegjs
Copy the code

The generated parser parses the above code:

const { parse } = require('./simple-arithmetics')
// ⚠️ throws Error after parsing failure, so correct syntax is raised
console.log(parse('a1*'))		// [ 'a', [ '1', '*' ] ]
console.log(parse('1 *'))    // peg$SyntaxError: Expected [a-z] but "1" found.
Copy the code

–cache

Let the parser cache the results to optimize performance.

–dependency

Specify the external dependencies of the parser. For example, –dependency ast:./ast.js will be imported into the generated parser, and you can use any of the exported methods in the module.

–export-var

The name of the global variable to which the parser object is assigned when no module loader is detected.

–extra-options

Specify parameters to be passed to the peg.generate function.

–extra-options-file

If there are too many parameters, it is really inconvenient and not intuitive to enter them in the CLI. Do this by specifying a JSON-formatted file as the peg.generate parameter.

–format

Specifies the format of the generator, supporting AMD, CommonJS, globals, umD, where commonJS is the default.

–optimize

Choose between the parsing speed (speed) or code size (size) of the optimized generated parser (default: speed)

–plugin

Specify peg.js to use a specific plug-in.

–trace

Setting this parameter enables the parser to display detailed progress during parsing.

The compiler parameter is not used much, a brief understanding can be.

Syntax and semantics

Let’s take a look at the official arithmetic parsing to understand the syntax and semantics and use of some expressions.

// Simple Arithmetics Grammar
/ / = = = = = = = = = = = = = = = = = = = = = = = = = =
//
// Accepts expressions like "2 * (3 + 4)" and computes their value.

Expression
  = head:Term tail:(_ @("+" / "-") _ @Term)* {
      return tail.reduce(function(result, element) {
        if (element[0= = ="+") return result + element[1];
        if (element[0= = ="-") return result - element[1];
      }, head);
    }

Term
  = head:Factor tail:(_ @("*" / "/") _ @Factor)* {
      return tail.reduce(function(result, element) {
        if (element[0= = ="*") return result * element[1];
        if (element[0= = ="/") return result / element[1];
      }, head);
    }

Factor
  = "(" _ @Expression _ ")"
  / Integer

Integer "integer"
  = _ [0-9] + {return parseInt(text(), 10); }

_ "whitespace"
  = [ \t\n\r]*
Copy the code

First, five rules are defined, each with its own name (for example, Expression) and corresponding parse Expression. If the input text matches the expression, the subsequent JS functions are executed. Something like Integer, “Integer” and an explicit error message, what does that mean? Here’s an example:

middle "middle"
= end The '*'

start 
= [a-z] middle

end
= [1-9]
Copy the code

Generate a parser based on the above rules to parse “A1!” The error message we received was:

peg$SyntaxError: Expected middle but "1" found.
Copy the code

This Expected middle above is our readable error message. If middle is removed, the following error is reported:

peg$SyntaxError: Expected "*" but "!" found.
Copy the code

This is also one of the features of PEg.js, which can accurately show the error of a matching expression. In order to learn more about expression types, the above arithmetic Grammer may not be appropriate, so let’s look at another example: parsing JSON strings:

// JSON Grammar
/ / = = = = = = = = = = = =
//
// Based on the grammar from RFC 7159 [1].
//
// Note that JSON is also specified in ECMA-262 [2], ECMA-404 [3], and on the
// JSON website [4] (somewhat informally). The RFC seems the most authoritative
// source, which is confirmed e.g. by [5].
//
// [1] http://tools.ietf.org/html/rfc7159
// [2] http://www.ecma-international.org/publications/standards/Ecma-262.htm
// [3] http://www.ecma-international.org/publications/standards/Ecma-404.htm
// [4] http://json.org/
// [5] https://www.tbray.org/ongoing/When/201x/2014/03/05/RFC7159-JSON

// ----- 2. JSON Grammar -----

// Value is an arbitrary space plus value, and the handler returns value directly
// The value inside the function is the former of the expression value:value, which is obtained from other rules
JSON_text
  = ws value:value ws { return value; }

begin_array     = ws "[" ws
begin_object    = ws "{" ws
end_array       = ws "]" ws
end_object      = ws "}" ws
name_separator  = ws ":" ws
value_separator = ws "," ws

// WS has an alias, whitespace, which is more semantic when reporting errors
ws "whitespace" = [ \t\n\r]*

// ----- 3. Values -----

// The/of the expression indicates preference for matching false
// Match null if the match fails
// Match true if not successful
/ /... In turn, matching
// If a string is not matched successfully, it is considered as a failure
value
  = false
  / null
  / true
  / object
  / array
  / number
  / string

// If it is the following string, it will be unstringed
false = "false" { return false; }
null  = "null"  { return null;  }
true  = "true"  { return true;  }

// ----- 4. Objects -----
// Match object, first match a {
// The matching expression for the inner structure members is
{name: "xx", value: "yy"
{name: "xx", value: "yy"},{name: "xx2", value: "yy2"}
// Then call the function to convert to {"xx": "yy", "xx2": "yy2"} structure
// What's next? Is an attempt to match the expression. Returns the result of the match if it was successful, null otherwise. Unlike regular expressions, there is no backtracking.
// Finally,}
{// If {members is null, set {}
object
  = begin_object
    members:(
      head:member
      tail:(value_separator m:member { returnm; * {})var result = {};

        [head].concat(tail).forEach(function(element) {
          result[element.name] = element.value;
        });

        returnresult; })? end_object {returnmembers ! = =null ? members: {}; }

// A matching expression for a member of an object, for example: "name" : "remainder"
// A string + : + a value
// Finally return {name, value} structure
member
  = name:string name_separator value:value {
      return { name: name, value: value };
    }

// ----- 5. Arrays -----
[1, 2, 3, a, b, c, {a: 1}]
// Start with a [
// The next matching type is the head of value
// Then match multiple times:
array
  = begin_array
    values:(
      head:value
      tail:(value_separator v:value { returnv; * {})return[head].concat(tail); })? end_array {returnvalues ! = =null ? values : []; }

// ----- 6. Numbers -----
// Match the numbers
// If there is a negative sign, the negative case
/ / integer
/ / the decimal point
/ / index
// Returns the matching text
number "number"
  = minus? int frac? exp? { return parseFloat(text()); }

/ / the decimal point
decimal_point
  = "."

digit1_9
  = [1-9]

// Index tag, e, or e
e
  = [eE]

/ / index
exp
  = e (minus / plus)? DIGIT+

/ / decimal places
frac
  = decimal_point DIGIT+

// The integer is 0 or 1-9 and matches 0-9 zero or more times
int
  = zero / (digit1_9 DIGIT*)

/ / minus sign
minus
  = "-"

/ / plus
plus
  = "+"

/ / match 0
zero
  = "0"

// ----- 7. Strings -----
// Matches the string
/ / double quotation marks
// Zero or more characters
/ / double quotation marks
// Concatenates matched chars results into a string
string "string"
  = quotation_mark chars:char* quotation_mark { return chars.join(""); }

// Match the character
// All non-escape characters, delimiters
char
  = unescaped
  / escape
    sequence: ('"'
      / "\ \"
      / "/"
      / "b" { return "\b"; }
      / "f" { return "\f"; }
      / "n" { return "\n"; }
      / "r" { return "\r"; }
      / "t" { return "\t"; }
      / "u" digits:$(HEXDIG HEXDIG HEXDIG HEXDIG) {
          return String.fromCharCode(parseInt(digits, 16)); {})return sequence; }

// Escape characters
escape
  = "\ \"

/ / double quotation marks
quotation_mark
  = '"'

// https://regex101.com/r/EAogfy/1
// Non-escape characters
unescaped
  = [^\0-\x1F\x22\x5C]

// ----- Core ABNF Rules -----

// See RFC 4234, Appendix B (http://tools.ietf.org/html/rfc4234).
DIGIT  = [0-9]
// Hexadecimal
HEXDIG = [0-9a-f]i

Copy the code

The above Grammer covers more than 80% of the parse expression types in the document. Let’s look at it from top to bottom:

“literal” | ‘literal’

Double – or single-quoted literals indicate exact matches, such as:

begin_array     = ws "[" ws
Copy the code

The array starts with [, preceded by Spaces, of course.

expression1 / expression2 /… / expressionn

// The/of the expression indicates preference for matching false
// Match null if the match fails
// Match true if not successful
/ /... In turn, matching
// If a string is not matched successfully, it is considered as a failure
value
  = false
  / null
  / true
  / object
  / array
  / number
  / string
Copy the code

JSON values can be in the same order as above, ⚠️. If the previous match is unsuccessful, the next match will be made.

expression { action }

This is almost always used, for example:

false = "false" { return false; }
Copy the code

String containing “false” returns Boolean false. {… } is the JavaScript code, get the match, and do any conversion here. There are four functions that can be called in the function body:

  • Text: indicates the text content of the matching expression.

  • Expected: Causes the parser to throw an exception with two parameters, a description of the expected content of the current location and optional location information.

  • Error: Also causes the parser to throw an exception, supporting two parameters, respectively is the error message and optional location information;

  • Location: Returns location information, as shown in the following object:

{
  start: { offset: 23.line: 5.column: 6 },
  end:   { offset: 25.line: 5.column: 8}}Copy the code

expression1 expression2 . expressionn

frac
  = decimal_point DIGIT+
Copy the code

For example, when ‘.123’ matches a FRAc rule, it returns an array of [‘.’, ‘123’].

label : expression

The label expression is also used, and the value of the label can be retrieved from the function body.

member
  = name:string name_separator value:value {
      // Name is the label in name:string
      // value is value: The value before value is the label value, and the value after value is the value in the rule
      return { name: name, value: value };
    }
Copy the code

expression ?

Try to match the expression. Returns the result of the match if it was successful, null otherwise. Unlike regular expressions, there is no backtracking.

// ----- 6. Numbers -----
// Match the numbers
// If there is a negative sign, the negative case
/ / integer
/ / the decimal point
/ / index
// Returns the matching text
number "number"
  = minus? int frac? exp? { return parseFloat(text()); }
Copy the code

Here is the peg.js expression combined with json.pegjs over again, to understand their basic usage. Other matches, such as [characters], (expression), expression *, etc., are touched in the re, so we won’t expand the examples here.

conclusion

Peg.js is a parser generator for JavaScript. Peg.js is a parser generator for JavaScript. Then learn the compiler generation process and parameters, often used parameters such as –allow-start-rules, –dependency and so on to do a detailed example. Finally, the usage of analytic expression is analyzed in detail based on Json.pegJS.

In short, there are three things you can do to write a compiler:

  • Parsing expression matching based on the input string (regular matching);
  • Do transformations based on generated results;
  • Output results;

Peg.js just simplifies the process of doing that. Standing on the shoulders of giants, we’ll implement a compiler of our own in the next article.