1. The introduction

Syntax-parser is a JS version of syntax parser generator, with the ability of word segmentation and syntax tree parsing.

Two examples are given to illustrate its functions.

The first example is to create a lexical parser, myLexer:

import { createLexer } from "syntax-parser";

const myLexer = createLexer([
  {
    type: "whitespace",
    regexes: [/^(\s+)/],
    ignore: true
  },
  {
    type: "word",
    regexes: [/^([a-zA-Z0-9]+)/] {},type: "operator",
    regexes: [/ / ^ (\ +)]}]);Copy the code

As shown above, “Spaces”, “letters or numbers”, and “plus signs” are matched by the re, and the matched Spaces are ignored (not printed).

Word segmentation matches from left to right, taking precedence over the first item in the array, and so on.

Next use myLexer:

const tokens = myLexer("a + b");

// tokens:
/ / /
// { "type": "word", "value": "a", "position": [0, 1] },
// { "type": "operator", "value": "+", "position": [2, 3] },
// { "type": "word", "value": "b", "position": [4, 5] },
// ]
Copy the code

‘A + b’ is divided into arrays of the “three types” defined above, each of which contains the original value and its location.

The second example is to create a syntax parser myParser:

import { createParser, chain, matchTokenType, many } from "syntax-parser";

const root = (a)= > chain(addExpr)(ast= > ast[0]);

const addExpr = (a)= >
  chain(matchTokenType("word"), many(addPlus))(ast= > ({
    left: ast[0].value,
    operator: ast[1] && ast[1] [0].operator,
    right: ast[1] && ast[1] [0].term
  }));

const addPlus = (a)= >
  chain("+"), root)(ast= > ({
    operator: ast[0].value,
    term: ast[1]}));const myParser = createParser(
  root, // Root grammar.
  myLexer // Created in lexer example.
);
Copy the code

Use chain function to write grammar expressions: by matching literals (such as + sign), and matchTokenType to fuzzy match the “three types” we parse above, a complete grammar expression is formed.

Syntax-parser also provides several other useful functions, such as many optional to match multiple times and zero or once, respectively.

Next use myParser:

const ast = myParser("a + b");

// ast:
/ / [{
// "left": "a",
// "operator": "+",
// "right": {
// "left": "b",
// "operator": null,
// "right": null
/ /}
// }]
Copy the code

2. The intensive reading

According to the following outline of ideas for source interpretation:

  • lexing
    • Vocabulary and Concepts
    • Word segmentation is
  • Syntax parsing
    • Vocabulary and Concepts
    • Reworking the JS execution Engine
    • Implementing the Chain function
    • The engine performs
    • When the execution is complete
    • Implementation of or logic
    • Implementation of many, optional, plus
    • Error message & Enter recommendations
    • The First set to optimize

lexing

Lexical parsing is a bit like NLP word segmentation, but when it is simpler than words, the logic of lexical parsing is unambiguous and is generally expressed in regular fragments.

Vocabulary and Concepts

  • Lexer: A lexical parser.
  • Token: morpheme after participle, includingValue: the value,Position of the position:,Type: type.

Word segmentation is

The createLexer function takes an array of re’s, so the idea is to go through the array of numbers, matching string by string.

We need these functions:

class Tokenizer {
  public tokenize(input: string) {
    // Call getNextToken to perform a regular match on the input string. After the match, SubString trims out the part that just matched, and then matches again until the string trims out
  }

  private getNextToken(input: string) {
    // Call getTokenOnFirstMatch to iterate over the regex match on the input string input, and return any matches immediately
  }

  private getTokenOnFirstMatch({
    input,
    type,
    regex
  }: {
    input: string;
    type: string;
    regex: RegExp; {})// Matches the input string input with the regular regex and returns the basic structure of the Token object}}Copy the code

Tokenize is the entry function that loops through getNextToken to match the Token and crop the string until the string is cropped.

Syntax parsing

Syntax parsing is based on lexical parsing. The input is Tokens and the Tokens are matched according to the grammar rules. When the Tokens are matched and fully conform to the grammar rules, the grammar tree is created.

A lexical parser generator is a “tool for generating lexical parsers”; as long as you enter the specified grammar description, the internal engine automatically does the rest.

The difficulty with this generator is that when the matching or logic fails, the call stack needs to be restored to where it was before the failure, whereas in the JS engine the call stack is not controlled by the code, so the code needs to be executed in the simulation engine.

Vocabulary and Concepts

  • Parser: Syntax Parser.
  • ChainNode: Continuous matching, performing one of four nodes in a chain.
  • TreeNode: Matches one of the four nodes in the chain.
  • FunctionNode: a FunctionNode, one of four nodes in the chain of execution.
  • MatchNode: Matches literals or tokens of a certain type and performs one of four nodes in the chain. Each correct Match consumes a Token.

Reworking the JS execution Engine

Why redo a JS execution engine? Look at the following code:

const main = () =>
  chain(functionA(), tree(functionB1(), functionB2()), functionC());

const functionA = () => chain("a");
const functionB1 = () => chain("b", "x");
const functionB2 = () => chain("b", "y");
const functionC = () => chain("c");
Copy the code

Assume that chain(‘a’) can match Token A and chain(functionC) can match Token C.

When the input is a, B, y, and c, how do we write the tree function?

We expect the match to fail on functionB1 and try again on functionB2 until one of them succeeds.

So the tree function might look like this:

function tree(. funs) {
  / /... Store current tokens
  for (const fun of funs) {
    / /... Reset current tokens
    const result = fun();
    if (result === true) {
      returnresult; }}}Copy the code

Keep trying the contents of the tree until you can match the result correctly and return the result. Since a correct match consumes Tokens, the current Tokens content needs to be stored before and after execution, and Tokens need to be restored and new execution links tried if execution fails.

It looks easy, doesn’t it?

However, here’s an example that breaks this nice assumption, so let’s change the values a bit:

const main = () =>
  chain(functionA(), tree(functionB1(), functionB2()), functionC());

const functionA = () => chain("a");
const functionB1 = () => chain("b", "y");
const functionB2 = () => chain("b");
const functionC = () => chain("y", "c");
Copy the code

The input is still a, B, Y, and c, and what happens?

The line functionA -> functionB1 is a, b, and y, but the result is a, B, y, and C, which does not match the input.

The correct line is functionA -> functionB2 -> functionC, and the result is a, b, y, c!

FunctionA -> functionB1 -> functionC link, when executing to functionC, it is found that the match is wrong, and there is no door to return to functionB2 at this time! Because the execution stack for tree(functionB1(), functionB2()) has exited and can never be found again.

So you need to simulate an execution engine that, at a fork in the road, willfunctionB2Save it and you can go back to this node and re-execute it at any time.

Implementing the Chain function

The best choice is to use a list to design the Chain function. We will simulate the JS call stack.

const main = () => chain(functionA, [functionB1, functionB2], functionC)();

const functionA = () => chain("a")();
const functionB1 = () => chain("b", "y")();
const functionB2 = () => chain("b")();
const functionC = () => chain("y", "c")();
Copy the code

The above example has only one minor change, that is, the function does not execute immediately.

Chain converts a function to a FunctionNode, a literal a or b to a MatchNode, [] to a TreeNode, and itself to a ChainNode.

We get the following linked list:

ChainNode (main) └ ─ ─ FunctionNode (functionA) ─ TreeNode ─ FunctionNode(functionC) │ ─ ─ FunctionNode (function(B1) └ ─ ─ FunctionNodefunctionB2)
Copy the code

As for why FunctionNode doesn’t expand directly to MatchNode, consider the description const list = () => chain(‘,’, list). In fact, the number of Tokens is always limited. If you expand them again, they will always match all Tokens.

Then we need a function to convert the different parameters received by the chain function into the corresponding Node:

const createNodeByElement = (
  element: IElement,
  parentNode: ParentNode,
  parentIndex: number,
  parser: Parser
): Node= > {
  if (element instanceof Array) {
    // ... return TreeNode
  } else if (typeof element === "string") {
    // ... return MatchNode
  } else if (typeof element === "boolean") {
    / /... True indicates that the match will succeed. False indicates that the match will fail. The Token is not consumed
  } else if (typeof element === "function") {
    // ... return FunctionNode}};Copy the code

createNodeByElementFunction source code

The engine performs

Engine execution is essentially accessing a linked list, with the visit function being the best means.

const visit = tailCallOptimize(
  ({
    node,
    store,
    visiterOption,
    childIndex
  }: {
    node: Node;
    store: VisiterStore;
    visiterOption: VisiterOption;
    childIndex: number; = > {})if (node instanceof ChainNode) {
      // Call 'visitChildNode' to access child nodes
    } else if (node instanceof TreeNode) {
      // Call 'visitChildNode' to access child nodes
      visitChildNode({ node, store, visiterOption, childIndex });
    } else if (node instanceof MatchNode) {
      // Match the current Token. If the match is successful, call 'visitNextNodeFromParent' to access the next parent Node. If the match fails, call 'tryChances', as explained in the or logic.
    } else if (node instanceof FunctionNode) {
      // Execute the function node and replace the current node with 'visit' again}});Copy the code

Since the visit function can be executed millions of times at most, tailCallOptimize is used for tail-recursive optimization to prevent memory or stack overflows.

The visit function is only responsible for accessing the node itself, while the visitChildNode function is responsible for accessing the node’s children (if any), and the visitNextNodeFromParent function is responsible for finding the next child of the parent node to visit if there are no children.

function visitChildNode({
  node,
  store,
  visiterOption,
  childIndex
}: {
  node: ParentNode;
  store: VisiterStore;
  visiterOption: VisiterOption;
  childIndex: number;
}) {
  if (node instanceof ChainNode) {
    const child = node.childs[childIndex];
    if (child) {
      Call the 'visit' function to access the child node 'child'
    } else {
      // If there are no children, call 'visitNextNodeFromParent' to find them}}else {
    // For TreeNode, if the last node is not accessed, add an "archive"
    / / call ` addChances `
    // Also, if there is a child element, the child element 'visit'}}const visitNextNodeFromParent = tailCallOptimize(
  (
    node: Node,
    store: VisiterStore,
    visiterOption: VisiterOption,
    astValue: any) = > {if(! node.parentNode) {// The function that finds the parent node has no parent, which is described below. Remember that this position is called the END bit.
    }

    if (node.parentNode instanceof ChainNode) {
      // A B <- next node C
      / / └ ─ ─ node < - the current node
      // As shown in the figure, find the nextNode and call 'visit'
    } else if (node.parentNode instanceof TreeNode) {
      // The TreeNode node is skipped directly with 'visitNextNodeFromParent'. Because only one branch of the TreeNode node is active at a time, it has no children}});Copy the code

As you can see, the visitChildNode and visitNextNodeFromParent functions take care of themselves and leave the rest of the work to other functions, which makes the code easier to understand.

With vist visitChildNode and visitNextNodeFromParent, access to nodes, child nodes, and, if there are no child nodes, trace back to the upper node.

visitFunction source code

When the execution is complete

When the visitNextNodeFromParent function calls the END bit, it’s time to finish:

  • When the Tokens just run out, the perfect match succeeds.
  • Tokens didn’t run out, failed to match.
  • There’s another failureChanceWhen you use up light, say it together with the or logic below.

Implementation of or logic

The “or” logic was the reason for refactoring the JS engine, and this problem is now well resolved.

const main = (a)= > chain(functionA[functionB1.functionB2].functionC) ();
Copy the code

The code above, for example, is treated as “or” logic when encountering an array structure [], with the child elements stored in the TreeNode node.

In the visitChildNode function, unlike ChainNode, when a TreeNode child is accessed, the addChances method is also called to store the execution state for the next child so that it can be resumed in the future.

AddChances maintains a pool where calls are first in and last out:

function addChances(/ *... * /) {
  const chance = {
    node,
    tokenIndex,
    childIndex
  };

  store.restChances.push(chance);
}
Copy the code

The opposite of addChance is tryChance.

TryChances is called in two cases:

  • MatchNodeThe match failed. Node matching failure is the most common failure, but ifchancesThe pool has an archive so you can go back and try again.
  • There is no next node, but the Tokens have not run outtryChancesKeep trying.

Let’s take a look at the magic save recovery function tryChances:

function tryChances(node: Node, store: VisiterStore, visiterOption: VisiterOption) {
  if (store.restChances.length === 0) {
    // Direct failure
  }

  const nextChance = store.restChances.pop();

  // reset scanner index
  store.scanner.setIndex(nextChance.tokenIndex);

  visit({
    node: nextChance.node,
    store,
    visiterOption,
    childIndex: nextChance.childIndex
  });
}
Copy the code

TryChances is very simple. In addition to failing without chances, finding the nearest chance node, restoring the Token pointer position, and visiting that node is equivalent to reading files.

addChanceThe source code

tryChancesThe source code

Implementation of many, optional, plus

These three methods are also nicely implemented.

Look at the optional function first.

export const optional = (. elements: IElements) = > {
  return chain([chain(...elements)(/ * * /)), true]) (/ * * /);
};
Copy the code

As you can see, the optional argument is actually a TreeNode:

chain(optional("a"()));/ / equivalent to the
chain(["a".true) ();Copy the code

Why is that? Because when ‘a’ fails, true is a match that will succeed without consuming the Token, the overall meaning is’ optional ‘.

To further explain, if ‘a’ does not match, then true must match, and matching true equals nothing, which means the expression does not exist.

Look again at functions plus that match one or more:

export const plus = (. elements: IElements) = > {
  const plusFunction = (a)= >chain(chain(... elements)(/ * * /), optional(plusFunction))(/ * * /);
  return plusFunction;
};
Copy the code

Can you tell? The plus function is equivalent to a new recursive function. That is:

const aPlus = (a)= > chain(plus("a"()));/ / equivalent to the
const aPlus = (a)= > chain(plusFunc)();
const plusFunc = (a)= > chain("a", optional(plusFunc))();
Copy the code

By recursively matching as many elements as possible, the optional option for each layer ensures that if any layer fails to match, it can jump to the next grammar without failure.

Finally, look at functions that match many:

export const many = (. elements: IElements) = > {
  returnoptional(plus(... elements)); };Copy the code

“Many” is optional Plus, right?

These three magic functions all take advantage of existing functionality, so it is advisable to leave a minute or so for each function to think about why.

optional plus manyFunction source code

Error message & Enter recommendations

Error prompts are similar to input recommendations in that they are expected input given an error location or cursor position.

Input recommendation is the function of giving the expected content behind the cursor given the string and cursor position.

First, find the last Token of the cursor by the cursor position, and then find all matchnodes that can be matched by the Token by findNextMatchNodes. This is the recommended result.

So how do you implement findNextMatchNodes? See the following:

function findNextMatchNodes(node: Node, parser: Parser) :MatchNode[] {
  const nextMatchNodes: MatchNode[] = [];

  let passCurrentNode = false;

  const visiterOption: VisiterOption = {
    onMatchNode: (matchNode, store, currentVisiterOption) = > {
      if (matchNode === node && passCurrentNode === false) {
        passCurrentNode = true;
        // Call visitNextNodeFromParent and ignore itself
      } else {
        // The MatchNode traversed
        nextMatchNodes.push(matchNode);
      }

      TryChances allows you to find all possible MatchnodestryChances(matchNode, store, currentVisiterOption); }}; newVisit({ node, scanner:new Scanner([]), visiterOption, parser });

  return nextMatchNodes;
}
Copy the code

The so-called finding of subsequent nodes is to find all Matchnodes through Visit, and Matchnodes only need to match once, because we only need to find matchnodes of the first tier.

By executing tryChances after each match, all Matchnodes will be found!

Look at the error prompt, we want to record the last error position, and then use the input recommendation.

But the cursor is at the desired input point, which should also be involved in generating the syntax tree, and the error prompt does not contain the cursor, so we perform the visit twice.

Here’s an example:

select | from b;
Copy the code

| is the cursor position, the statement content is to select the from b; Obviously wrong, but the cursor position should give a hint, and giving a hint requires parsing the syntax tree correctly, so for the hint function, we need to take the cursor position into account. So there are two parses.

findNextMatchNodesFunction source code

The First set to optimize

Building the First set is a bottom-up process. When a MatchNode is accessed, its value is a First value of its parent node. When the First set of the parent node is collected, the judgment of collection of the First set of its parent node will be triggered. The last node to complete the First collection is the top-level node.

For reasons of length, which I won’t go into again, you can look at this picture.

generateFirstSetFunction source code

3. Summary

This article is a summary of the Handwritten SQL Compiler series, from the source point of view!

Each article in the series presents technical details in graphic form and can be read as a supplement:

  • Intensive reading of Handwritten SQL Compilers – Lexical Analysis
  • Intensive reading of Handwritten SQL Compilers – Introduction to Grammar
  • Intensive reading of Handwritten SQL Compilers – Syntax Analysis
  • Close reading of Handwritten SQL Compilers – Backtracking
  • Close reading of “Handwritten SQL Compiler – Syntax Tree”
  • Close reading of “Handwritten SQL Compiler – Error Warning”
  • Close reading of Handwritten SQL Compilers – Caching for Performance Optimization
  • Close reading “Handwriting SQL Compiler – Smart Tips”

Syntax-parser source code · Issue #133 · dt-fe/weekly

If you’d like to participate in the discussion, pleaseClick here to, with a new theme every week, released on weekends or Mondays. Front end Intensive Reading – Helps you filter the right content.