1 the introduction
Last time intensive reading “Handwritten SQL compiler – Syntax analysis” said how to use Js functions to achieve syntax analysis, left a backtracking problem, that is, archiving, reading file problems.
We think of the parse tree as a maze, with straight lines and forks. To get out of the maze, we need to save ahead of time when we encounter forks, and then try the next fork when we make a mistake. This feature is called backtracking.
In the last article, we implemented a branch function that rolled back TokenIndex and tried again after the branch failed, but in the function call stack, if the sub-function finished and the stack jumped, we could not find the original stack to execute again.
To describe this problem in more detail, take an example of a fork in the road:
a -> tree() -> c
-> b1 -> b1' -> b2 -> b2'
Copy the code
A -> b1 -> b1′ -> C and a -> B2 -> B2 ‘-> C. When the branch b1 fails, the branch function tree can restore to b2 and retry.
However, if b1 -> b1′ passes, but b1 -> b1′ -> c does not pass, the call stack of the branch function tree has exited after b1′ is executed, and the route b2 -> B2 ‘cannot be tried again.
To solve this problem, we need to manually construct the function execution process through the linked list. This can not only achieve arbitrary backtracking, but also solve the problem of left recursion. Since the function is not executed immediately, we can add some Magic actions before execution, such as switching the execution order! This article focuses on how to call the stack through the linked list constructor and implement backtracking.
2. Intensive reading
Suppose we had a chain function that could represent continuous matches in a simpler way:
const root = (tokens: IToken[], tokenIndex: number) = > match('a', tokens, tokenIndex) && match('b', tokens, tokenIndex) && match('c'Tokens tokenIndexconst root = (chain: IChain) = > chain('a'.'b'.'c')
Copy the code
When branching conditions are encountered, replace the tree function with an array representation:
const root = (tokens: IToken[], tokenIndex: number) = > tree(
line(match('a', tokens, tokenIndex) && match('b', tokens, tokenIndex)),
line(match('c', tokens, tokenIndex) && match('d'Tokens tokenIndex)const root = (chain: IChain) = > chain([
chain('a'.'b'),
chain('c'.'d')])Copy the code
The chain function has two properties:
- Without immediate execution, we can generate the execution chain in advance, optimize the chain structure, and even control the execution sequence to realize the backtracking function.
- No need to display Token transfer, reducing the amount of code written for each step of matching.
Encapsulate scanner and matchToken
We can make the scanner function to encapsulate the token operation:
const query = "select * from table;";
const tokens = new Lexer(query);
const scanner = new Scanner(tokens);
Copy the code
Scanner has two main functions: read reads the current token content, and Next moves the token down one bit. We can encapsulate the new matchToken function based on this function:
function matchToken(
scanner: Scanner,
compare: (token: IToken) => boolean
): IMatch {
const token = scanner.read();
if (!token) {
return false;
}
if (compare(token)) {
scanner.next();
return true;
} else {
return false;
}
}
Copy the code
Returns false and does not consume the token if the token is exhausted, or does not match the match, and true if the token is consumed when matched.
Now we can write a matching code using the matchToken function:
const query = "select * from table;";
const tokens = new Lexer(query);
const scanner = new Scanner(tokens);
const root =
matchToken(scanner, token= > token.value === "select") &&
matchToken(scanner, token= > token.value === "*") &&
matchToken(scanner, token= > token.value === "from") &&
matchToken(scanner, token= > token.value === "table") &&
matchToken(scanner, token= > token.value === ";");
Copy the code
We ultimately want to express this structure:
const root = (chain: IChain) = > chain("select"."*"."from"."table".";");
Copy the code
Since the chain function serves as a clue throughout the process, the scanner function needs to be included in the chain closure and passed internally, so we need to construct the first chain.
Encapsulation createChainNodeFactory
We need the createChainNodeFactory function to pass the scanner in, secretly save it internally, do not display it in external code, and the chain function is a higher-order function that does not execute immediately, thus encapsulating the second-order function:
const createChainNodeFactory = (scanner: Scanner, parentNode? : ChainNode) = > (
...elements: any[]
): ChainNode= > {
// Generate the first node
return firstNode;
};
Copy the code
Two points need to be made:
- The chain function returns the first list node, and the entire list can be accessed through the visiter function.
(... elements: any[]): ChainNode
Is the chain function itself, which takes a series of parameters and classifies functions by type.
With createChainNodeFactory, we can generate the execution entry:
const chainNodeFactory = createChainNodeFactory(scanner);
const firstNode = chainNodeFactory(root); // const root = (chain: IChain) => chain('select', '*', 'from', 'table', '; ')
Copy the code
To support chain(‘select’, ‘*’, ‘FROM ‘, ‘table’, ‘; ‘) syntax, we need to automatically generate a matchToken function as the linked list node when the parameter type is text type, and associate the linked list node through reduce function:
const createChainNodeFactory = (scanner: Scanner, parentNode? : ChainNode) = > (
...elements: any[]
): ChainNode= > {
let firstNode: ChainNode = null;
elements.reduce((prevNode: ChainNode, element) = > {
const node = new ChainNode();
// ... Link node
node.addChild(createChainChildByElement(node, scanner, element));
return node;
}, parentNode);
return firstNode;
};
Copy the code
Use reduce function to link list and node, this step is normal so ignore, through classifying incoming function createChainChildByElement function, if the incoming function is a string, it constructs a matchToken function into the current list of child elements, When the linked list is executed, the matchToken function is executed.
The focus is on the treatment of linked list nodes, first introduce the structure of the list.
Chain table structure
class ChainNode {
public prev: ChainNode;
public next: ChainNode;
public childs: ChainChild[] = [];
}
class ChainChild {
// If type is function, when run it, will expend.
public type: "match" | "chainNode" | "function";
publicnode? : IMatchFn | ChainNode | ChainFunctionNode; }Copy the code
ChainNode is a definition of a linked list node. Some of the definitions relevant to the current article are given here. A bidirectional list is used here, so each node has prev and next attributes pointing to the previous and next nodes respectively, and childs is a node mounted under this list, which can be a matchToken function, a list node, or a function.
The entire list structure might look like this:
node1 <-> node2 <-> node3 <-> node4
|- function2-1
|- matchToken2-1
|- node2-1 <-> node2-2 <-> node2-3
|- matchToken2-2-1
Copy the code
For each node, there is at least one child element. If there are more than one child elements, it indicates that the node is a tree node and there are branches.
The node type ChainChild can also be seen from the definition. There are three types, which we describe separately:
MatchToken type
This type is the most basic type and is generated by the following code:
chain("word");
Copy the code
In linked list execution, match is the basic unit of execution that determines whether a statement will match, and it is the only unit that consumes tokens.
The node type
A child node of a linked list node may also be a node, analogous to a nested function, generated by code like this:
chain(chain("word"));
Copy the code
That is, one element of the chain is the chain itself, and the chain sub-linked list will act as the child element of the parent node. When the execution reaches the linked list node, the depth-first traversal will be carried out. If the execution passes, it will jump to the parent to continue looking for the next node, and its execution mechanism is similar to the entry and exit relationship of function call stack.
Function types
Function types are very special, and we don’t need to recursively expand all function types, because the grammar can be infinitely recursive.
It is like a maze, where many areas are identical and repeated. If the maze is completely unfolded, the size of the maze will reach infinity. So as the computer executes, we have to expand these functions step by step, so that the end of the maze depends on whether the Token is exhausted, the Token is out of the maze, or the Token is not matched. Instead of running out of resources when the maze is generated. The function type node is generated by the following code:
chain(root);
Copy the code
All function type nodes are expanded at the time of execution. If a function node is encountered again during expansion, it is retained until the next execution.
branch
A normal link is a special case of a branch. The following code is equivalent:
chain("a");
chain(["a"]);
Copy the code
Compare the following code:
chain(["a"]);
chain(["a"."b"]);
Copy the code
Either a line or a branch can be regarded as a branch route, while the case of a line (no branch) can be regarded as a branch with only one fork, compared to a list node, corresponding to a list node with only one element childs.
back
The chain function now supports three sub-elements, a branching expression:
chain("a"); // MatchNode
chain(chain("a")); // ChainNode
chain(foo); // FunctionNode
chain(["a"]); // Branch -> [MatchNode]
Copy the code
As mentioned above, the chain function does not execute immediately, so when we execute this code, we just generate the list structure and do not actually execute the content, which is contained in the childs.
We need to construct execChain, take the first node of the list and traverse the list nodes through the visiter function to actually execute.
function visiter(chainNode: ChainNode, scanner: Scanner, treeChances: ITreeChance[]) :boolean {
const currentTokenIndex = scanner.getIndex();
if(! chainNode) {return false;
}
const nodeResult = chainNode.run();
let nestedMatch = nodeResult.match;
if (nodeResult.match && nodeResult.nextNode) {
nestedMatch = visiter(nodeResult.nextNode, scanner, treeChances);
}
if (nestedMatch) {
if(! chainNode.isFinished) {// It's a new chance, because child match is true, so we can visit next node, but current node is not finished, so if finally falsely, we can go back here.
treeChances.push({
chainNode,
tokenIndex: currentTokenIndex
});
}
if (chainNode.next) {
return visiter(chainNode.next, scanner, treeChances);
} else {
return true; }}else {
if (chainNode.isFinished) {
// Game over, back to root chain.
return false;
} else {
// Try again
scanner.setIndex(currentTokenIndex);
returnvisiter(chainNode, scanner, treeChances); }}}Copy the code
In the code above, nestedMatch is analogous to nested functions, and treeChances is the key to backtracking.
The current node fails to execute
Since each node contains N children, any time the execution fails, the child of the node is marked and the child of the current node is determined to be tried. False is returned only when all nodes fail.
The location is archived when the current node is successfully executed
When the node succeeds, the current execution position needs to be recorded to prevent subsequent link execution failures, that is, treeChances are used to save a saving point.
However, we do not know when the whole linked list will fail, so we have to wait for the completion of the whole Visiter to know whether the execution failed, so we need to determine whether there are treeChances at the end of each execution:
while(! result && treeChances.length >0) {
const newChance = treeChances.pop();
scanner.setIndex(newChance.tokenIndex);
result = judgeChainResult(
visiter(newChance.chainNode, scanner, treeChances),
scanner
);
}
Copy the code
At the same time, we need to add a new field tokenIndex to the linked list structure for backtracking, and call the setIndex method of scanner function to restore the token position.
Finally, if the opportunity is exhausted, the match fails, as long as there is any chance, or one life to complete the game, the match is successful.
3 summary
In this article, we use linked lists to rewrite the function execution mechanism, not only to make the matching function have backtracking capability, but also to make its expression more intuitive:
chain("a");
Copy the code
This construction is essentially the same as compiling code from grammar structure, except that many lexical parsers parse text into code, and we use code to express grammar structure, and the result of executing itself is “compiled code.”
Next time we’ll explore how to solve left recursion problems automatically, allowing us to write expressions like this:
const foo = (chain: IChain) = > chain(foo, bar);
Copy the code
Fortunately, the chain function does not execute immediately, so we will not immediately fall into the vortex of stack overflow, but during the execution of the node, the function will expand indefinitely and the stack overflow will occur.
Solving left recursion is not easy. Is there any alternative to rewriting grammars manually or automatically? Welcome to comment.
4 More Discussions
The discussion address is: Close reading handwritten SQL Compilers – Backtracking · Issue #96 · dT-fe /weekly
If you’d like to participate in the discussion, pleaseClick here to, with a new theme every week, released on weekends or Mondays.