For more articles, visit my blog at www.weeklybigbang.com/
preface
Recently, some students have encountered the need to write a domain-specific language (DSL) and its parser while making some efficiency tools. I happen to have relevant experience to point you to the north.
First of all, have you thought about how to do this function?
This article will focus on how to implement expressions like Excel =C1+C2+”123″ this example, without the relevant knowledge of compilation principles, using regular expressions as an analogy, with the help of a tool library, describe the general steps to implement a dome-specific language parser. Allows you to quickly implement a parser. At the same time, the article will give a similar MySQL Where expression into MongoDB query examples to enrich the application here.
Re and its limits
In daily work, you often encounter pattern matching problems, for example, you can need to extract the area code and the phone number from the phone number format 0755-8771032, and then save; You may need to determine whether an email address such as [email protected] is valid; Or you may need to implement something similar to Excel, such as user input =C1+C2+”123″, where you concatenate the content of C1 with the content of C2 and the string “123”.
The usual way to do this is to use regular expressions. In Python’s case, the API provided by the system can be seen as a three-step process:
import re
pattern = "^ ([0-9]) - ([0-9] +) $" // 1. write regex string of phone number
prog = re.compile(pattern) // 2. compile regex to a matcher
result = prog.match(string) // 3. match string and handle results
Copy the code
So far, the regular expressions look fine. Take the =C1+C2+”123″ requirement as an example. You might think we’d just split it up by the operators (+, -, *, etc.) and then evaluate, but consider the following three cases:
- Operators have precedence, for example
=C1+C2*C3
和=C1*C2+C3
, need to calculate first*
To calculate+
. - Strings have operators inside them, for example
=C1+C2+"=C1+C2"
. - Operations have left and right parentheses matching to change the precedence of operations, for example
=(C1+C2)*C3
This is where using regular expressions alone can be tricky.
General practice
The common practice is to define domain-specific syntax, formalize the syntax (like writing regular expressions), implement a Parser that converts the code into an abstract syntax tree (AST), and parse and run the abstract syntax tree.
The whole process sounds complicated, but it takes a lot of time to build an Parser from scratch. Is there a way to generate an abstract syntax tree from the Parser as we would from a regular Parser? The answer is yes, for example, C language has Bison framework, JS choice is more, you can choose Jison, Parsimmon, PEg. JS, Nearley, etc., this article is based on the number of users of Nearley framework.
How to write a parser
Similar to writing regs, the process of using a Parser generator such as Nearley is divided into three steps.
1. Use BNF to represent your DSL syntax
BNF, or Backus — Naur form, is a way of expressing context-free grammar. The syntax Nearley is based on the Extended Backus — Naur Form (EBNF) of BNF. Here’s the Nearley syntax file I wrote for this Excel expression (for ease of understanding, only operator precedence is implemented here, not parentheses) :
grammar.ne
@builtin "number.ne"
@builtin "whitespace.ne"
@builtin "string.ne"@ {%function buildAssignmentExpression(d) {
return {
type: "AssignmentExpression", op: d[2], left: d[0], right: d[4] }; %}}# Assignment
Exp -> Assignment {% id %}
| Value {% id %}
Assignment -> "=" _ Expression {% d => {
return {
type: "Assignment",
value: d[2]
}
} %}
# Expression
Expression -> AddSubExpression {% id %}
# Expression for Add Sub
AddSubExpression -> AddSubExpression _ "+" _ MulDivExpression {% d => buildAssignmentExpression(d) %}
| AddSubExpression _ "-" _ MulDivExpression {% d => buildAssignmentExpression(d) %}
| MulDivExpression {% id %}
# Expression for Mul Div
MulDivExpression -> Identifier _ "*" _ MulDivExpression {% d => buildAssignmentExpression(d) %}
| Identifier _ "/" _ MulDivExpression {% d => buildAssignmentExpression(d) %}
| Value _ "*" _ MulDivExpression {% d => buildAssignmentExpression(d) %}
| Value _ "/" _ MulDivExpression {% d => buildAssignmentExpression(d) %}
| Value {% id %}
| Identifier {% id %}
# Cell Identifier
Identifier -> [A-Z]:+ [0-9]:+ {%
function(d) {
return {
'type': "AssignmentIdentifier".'column': d[0].join(""),
'line': d[1].join(""%)}}}# Values
Value -> _value {%
function(d) {
return {
'type': "Value".'value': d[0]
};
}
%}
_value -> int {% id %}
| unsigned_decimal {% id %}
| decimal {% id %}
| dqstring {% id %}
| sqstring {% id %}
| btstring {% id %}
Copy the code
1.1 Introduction of syntax modules
Let’s walk through the contents of this file step by step, starting with the header code:
@builtin "number.ne"
@builtin "whitespace.ne"
@builtin "string.ne"
Copy the code
Nearley defines some common syntax. This code introduces Nearley’s predefined number, space, and string syntax. Once introduced, the generated Parser can recognize strings like “123” and numbers like 123.
The built-in syntax module for Nearley can be viewed here.
1.2 Helper variables and functions
Here’s the code:
@ {%function buildExpression(d) {
return {
type: "Expression", op: d[2], left: d[0], right: d[4] }; %}}Copy the code
In Nearley, {% raw %}@{%… %}{% endRAW %} is a global declaration of variables that can be used in the productive Post Processor. And what is called production is going to come right after that.
1.3 Writing production
Let’s take one of the more complicated production expressions:
MulDivExpression -> Identifier _ "*" _ MulDivExpression {% d => buildAssignmentExpression(d) %}
| Identifier _ "/" _ MulDivExpression {% d => buildAssignmentExpression(d) %}
| Value _ "*" _ MulDivExpression {% d => buildAssignmentExpression(d) %}
| Value _ "/" _ MulDivExpression {% d => buildAssignmentExpression(d) %}
| Value {% id %}
| Identifier {% id %}
Copy the code
Nearley | inside this operator is a syntactic sugar, the production could be expressed as multiple production:
MulDivExpression -> Identifier _ "*" _ MulDivExpression {% d => buildAssignmentExpression(d) %}
MulDivExpression -> Identifier _ "/" _ MulDivExpression {% d => buildAssignmentExpression(d) %}
MulDivExpression -> Value _ "*" _ MulDivExpression {% d => buildAssignmentExpression(d) %}
MulDivExpression -> Value _ "/" _ MulDivExpression {% d => buildAssignmentExpression(d) %}
MulDivExpression -> Value {% id %}
MulDivExpression -> Identifier {% id %}
Copy the code
Before introducing each production, let’s introduce two concepts:
- symbol: represents a part of the code, such as the if statement
if (...) {... }
The whole thing can be viewed as a symbol, a string"123"
You can view it as a symbol, a symbol is a recursive concept, a symbol can contain other symbols. For example,if (...) { a = "123" }
The “if” symbol contains the string symbol"123"
. - terminatorWhen a symbol contains no other symbol, it is a terminal. Like string symbols
"123"
In the1
This is a terminal, because it can’t subdivide any other symbols.
Specifically, each production formula can be divided into three parts:
->
To the left of is a non-terminal symbol, which represents the concept of a parent. It can contain multiple symbols or terminals.->
The content on the right-hand side is an expansion expression of the left-hand symbol, which represents how the symbol can be expanded. It can contain multiple symbols or terminators.- The last part is the Nearley Post Processor, which will execute after this production is applied. It is also a piece of JS code that can use the Helper variables and functions we defined earlier. Its run result will be the run result of the whole production.
As you can already see, regular expressions can also be written using BNF. In fact, regular expressions are also context-free, so it is natural to use BNF.
2. The generated Parser
Generating the Parser uses the Nearley framework we introduced earlier. First we save the BNF syntax definitions given above to the file grammar.ne.
- Let’s run it
npm install --save nearley
To install the Nearley dependency for the project and run itnpm install -g nearley
To install global dependencies for the nearley-related commands. - run
nearleyc grammar.ne -o grammar.js
Generate Parser filesgrammar.js
. - To parse the DSL code, run the following:
const nearley = require("nearley");
const grammar = require("./grammar.js");
// Create a Parser object from our grammar.
const parser = new nearley.Parser(nearley.Grammar.fromCompiled(grammar));
// Parse something!
parser.feed("=C1+C2*C3");
// parser.results is an array of possible parsings.
console.log(parser.results);
Copy the code
3. Parse the Parser result
After step 2 is completed, we can get the abstract syntax tree corresponding to the DSL code. The abstract syntax tree is actually a JSON object, for example, =C1+D1*E1. The structure of the CORRESPONDING JSON object is shown in the figure below
So the next step is how to parse the tree structure of the object, and then get its corresponding result. Here we use the simplest self-loop parser to evaluate the tree. The principle of the self-loop parser is very simple. We evaluate the AST tree from the bottom up. The whole process is completed by deep traversal of the tree.
To evaluate, we define some atomic operations on non-leaf nodes of the tree:
Identifier
: Get the corresponding column in Excel and treat it asIdentifier
The value of the node is returned.Expression
: evaluates the left and right operators of the Expression node according to the operator, such as aExpression
The operator of*
, it willExpression
Multiply the left and the right of theta.Assignment
Will:Assignment
Under theExpression
Is returned as the statement.
With the above atomic operations, we can start our evaluation. After the initial depth traversal of the Identifier corresponding to D1 and E1, we can replace the value of Identifier according to the above atomic operations. Assume that D1 and E1 correspond to 11 and 12 respectively. Then, after the first recursive evaluation, the tree becomes:
The recursion of the next layer evaluates the Identifier and Expression nodes of the second layer. Based on the above atomic operation, assuming that C1 corresponds to 33, the tree becomes:
And so on, we get the final value of the tree 33 + 132 = 165.
Below is the code that implements recursion and the corresponding AST. For some students, it may be easier to understand by looking at the code directly:
ast.json
{
"type": "Assignment"."value": {
"type": "AssignmentExpression"."op": "+"."left": {
"type": "AssignmentIdentifier"."column": "C"."line": "1"
},
"right": {
"type": "AssignmentExpression"."op": "*"."left": {
"type": "AssignmentIdentifier"."column": "D"."line": "1"
},
"right": {
"type": "AssignmentIdentifier"."column": "E"."line": "1"}}}}Copy the code
eval.js
function evalAst(exp, rows) {
if (exp.type == "Assignment") {
return evalAst(exp.value, rows);
}
if (exp.type == "Value") {
return exp.value;
}
if (exp.type == "AssignmentIdentifier") {
return rows[exp.line][exp.column];
}
if (exp.type == "AssignmentExpression") {
switch(exp.op) {
case "+":
return evalAst(exp.left, rows) + evalAst(exp.right, rows);
case "-":
return evalAst(exp.left, rows) - evalAst(exp.right, rows);
case "*":
return evalAst(exp.left, rows) * evalAst(exp.right, rows);
case "/":
return evalAst(exp.left, rows) / evalAst(exp.right, rows);
default:
throw new Error("invalid operator");
break; }}throw new Error("invalid expression type");
}
Copy the code
The final DEMO can be viewed here:
- Code: stackblitz.com/edit/react-…
- DEMO: react – excel – example. Stackblitz. IO /
Another example
To get a better understanding, here’s another requirement that converts MySQL to where filtering in the cloud, with BNF syntax and Eval JS code:
grammar.ne
@builtin "number.ne"
@builtin "whitespace.ne"
@builtin "string.ne"@ {%function buildExpression(d) {
return {
type: "Expression", op: d[2], left: d[0], right: d[4] }; %}}# exp
Exp -> Binop {% id %}
Binop -> ExpOr {% id %}
ExpOr -> ExpOr __ "or" __ ExpAnd {% d => buildExpression(d) %}
| ExpAnd {% id %}
ExpAnd -> ExpAnd __ "and" __ ExpComparison {% d => buildExpression(d) %}
| ExpComparison {% id %}
ExpComparison ->
Name _ "<" _ Value {% d => buildExpression(d) %}
| Name _ ">" _ Value {% d => buildExpression(d) %}
| Name _ "< =" _ Value {% d => buildExpression(d) %}
| Name _ "> =" _ Value {% d => buildExpression(d) %}
| Name _ "~ =" _ Value {% d => buildExpression(d) %}
| Name _ "= =" _ Value {% d => buildExpression(d) %}
# variables
Name -> _name {%
function(d) {
return {
'type': "Identifier".'name': d[0]
};
}
%}
_name -> [a-zA-Z_] {% id %}
| _name [\w_] {% function(d) {returnd[0] + d[1]; %}}# values
Value -> _value {%
function(d) {
return {
'type': "Value".'value': d[0]
};
}
%}
_value -> int {% id %}
| unsigned_decimal {% id %}
| decimal {% id %}
| dqstring {% id %}
| sqstring {% id %}
| btstring {% id %}
Copy the code
eval.ts
type Expression = {
type: "Expression",
op: "or" | "and" | "<" | ">" | "> =" | "< =" | "~ =" | "= =",
left: Expression | Identifier,
right: Expression | Value
}
type Identifier = {
type: "Identifier",
name: string
}
type Value = {
type: "Value",
value: number | string
}
export default function __eval(ast: Expression, _: any) {
function evalExpression(expression: Expression) {
switch (expression.op) {
case "or":
return _.or([
evalExpression((expression as any).left),
evalExpression((expression as any).right),
]);
break;
case "and":
return _.and([
evalExpression((expression as any).left),
evalExpression((expression as any).right),
]);
break;
case "<":
return {
[(expression as any).left.name]: _.lt((expression as any).right.value)
}
break;
case ">":
return {
[(expression as any).left.name]: _.gt((expression as any).right.value)
}
break;
case "> =":
return {
[(expression as any).left.name]: _.gte((expression as any).right.value)
}
break;
case "< =":
return {
[(expression as any).left.name]: _.lte((expression as any).right.value)
}
break;
case "~ =":
return {
[(expression as any).left.name]: _.neq((expression as any).right.value)
}
break;
case "= =":
return {
[(expression as any).left.name]: _.eq((expression as any).right.value)
}
break;
default:
throw new Error("invalid expression");
break; }}return evalExpression(ast)
}
Copy the code
conclusion
At this point you should be able to write your own DSL and parser. Learning to write your own DSL and parser actually has other benefits. For example, it gives you a better understanding of what a configuration system is. ) {… } else { … } the config (…). If you’re implementing a drag-and-drop UI generator, you’re creating an AST tree, and then implementing a parser that parses the AST to render the results. And conversely, you can think about what kind of capabilities your configuration system can achieve, and the upper limit is that it can achieve the same functionality as writing code, but I don’t recommend this, because some of the solutions in the industry like Blockly or flowcharts to express logic are not very good experience. At the same time, these systems require users to write as much code as they do.