1 the introduction

Return to the “Handwritten SQL Editor” series. Previous installments covered morphology, grammar, parsing of syntax, and implementation of backtracking. This time, I will introduce how to generate a syntax tree.

Based on the idea of “backtracking”, we use JS to achieve a miniature SQL parser, and introduce how to generate syntax tree, how to achieve syntax tree generation function in JS SQL engine!

The analytic objectives are:

select name.version from my_table;
Copy the code

Grammar:

const root = (a)= > chain(selectStatement, many(";", selectStatement));

const selectStatement = (a)= > chain("select", selectList, fromClause);

const selectList = (a)= > chain(matchWord, many(",", matchWord));

const fromClause = (a)= > chain("from", matchWord);

const statement = (a)= >
  chain(
    "select",
    selectList,
    "from",
    chain(tableName, [whereStatement, limitStatement])
  );
Copy the code

This is a condensed version of the implementation of this article for convenience. See our open source repository cParser for the full version.

Root is the entry function, and the grammar of the many() package can be executed any time, so

chain(selectStatement, many(";", selectStatement));
Copy the code

Indicates that selectStatements of arbitrary length are allowed. The same thing with selectList.

MatchWord means to match any word.

Syntax trees are artificial abstractions of syntax structures. Essentially, if we stop here, we can generate a basic syntax tree, which is a multi-dimensional array, such as:

const fromClause = (a)= > chain("from", matchWord);
Copy the code

The default syntax tree generated by this grammar is [‘from’, ‘my_table’], but only the current grammar knows what the meaning of from my_table is (the first flag has no meaning, the second flag represents the table name).

The fromClause returns a syntax tree that is passed as a result to the grammar selectStatement. The result can be: [‘select’, [[‘name’, ‘version’]], [‘from’, ‘my_table’]].

You can easily see the problem: when the default syntax tree is clustered together, it is impossible to understand the syntax meaning without the grammar structure. To understand the syntax tree without the grammar structure, we need to abstract it into a rule-based structure.

2. Intensive reading

Through the above analysis, we need to provide the chain function with the ability to modify the local AST structure:

const selectStatement = (a)= >
  chain("select", selectList, fromClause)(ast= > ({
    type: "statement",
    variant: "select",
    result: ast[1].from: ast[2]}));Copy the code

We can modify the default syntax tree with additional parameters, change the multi-dimensional array structure to an object structure, and add the Type Variant attribute to indicate the type and subtype of the current object. For example, in the above example, the returned object tells the user: “I am an expression, a SELECT expression, my result is result, and my source table is from.”

So how does the chain function implement the syntax tree function?

For each grammar (each chain function), its syntax tree must wait for all child elements to execute before it can be generated. So it’s a depth-first operation.

The following figure describes the chain function execution mechanism:

There are four basic structures in the generated structure, namely Chain, Tree, Function, and Match, which are sufficient to express all the logic needed for parsing. (Optional, multiple choice logic is not included).

The syntax tree of the current node is generated only after all the children of each element have been executed. In fact, after each node is executed, callParentNode will be called to access the parent node. If this function is executed, it indicates that the child element has been successfully executed. Complete the AST information of the corresponding node.

To modify a local AST structure, wait until the entire ChainNode is complete before calling it and store the new AST information returned as the final AST information for the node and pass it to the parent (or none, which is the AST result for the root).

3 summary

This article explained how to generate a syntax tree, explained the existence of a default syntax tree, and explained why we wanted a custom syntax tree to make it easier to understand the meaning.

It also shows you how to run a complete set of syntax parsers through JS and how to provide the ability to customize AST structures.

The model described in this article is a simplified version customized for ease of understanding. For full details, visit CParser.

Finally, why do you want to make this syntax parser? There are many open source AST parsing tools available today, but the scenario I’m addressing is syntactic autoprompt, which requires all possible inputs for the current cursor position in the case of incomplete or even incorrect statements. Therefore, the syntax parser kernel can be completely rewritten to automatically recover from errors in common error scenarios while parsing and generating syntax trees at the same time.

At present, the performance is being optimized, and the general SQL grammar is still being improved. At present, it can only be used as a reference for learning, not for production environment.

4 More Discussions

Close reading of Handwritten SQL Compilers – Syntax Trees · Issue #99 · dT-fe /weekly

If you’d like to participate in the discussion, pleaseClick here to, with a new theme every week, released on weekends or Mondays.