Click “CSDN” at the top and select “Top public account”
Critical moment, the first time to arrive!
By Minko Gechev
The night breeze is blowing
Translator’s Note: Even for professional programmers, building a compiler can be a challenging task. This article will guide you through the process.
I’ve already written several articles related to programming language development, which I’m very excited about! For example, in “Static Code Analysis in Angular 2 and TypeScript Projects” [1], I explored the basics of the compiler front end, explaining the phases of lexical analysis, syntax analysis, and abstract syntax trees.
I recently published “Developing statically typed Programming Languages [2].” This article presents a simple, statically typed functional programming language inspired by lambda calculus. I outsourced the front end of compiler development to the parser generator and focused on the back end of the module for type checking, as well as the module for code generation.
Why do I need this?
You might be thinking “Why do I need to know how to develop a compiler?” For several important reasons:
-
You will have a better understanding of how the programming language you are using works. This will enable you to develop more efficient programs with the language.
-
Often, you will have to reuse the modules described below for different purposes (for example, parsing configuration files, parsing network messages, and so on).
-
Create a DSL. It can be very convenient to create a domain-specific language in your project to simplify tasks that would otherwise take more time to solve with a general-purpose programming language.
What is the scope of our discussion?
In this blog post, we’ll cover the basics from end-to-end. We will develop a very simple compiler with 25 lines of JavaScript code! Our compiler will contain:
-
Lexical analysis module
-
Grammar analysis module
-
The parser will be based on EBNF syntax
-
We will develop the parser using a recursive descending parsing algorithm
-
Code generator
The language we will explore is not particularly useful for developing meaningful software programs, but it can easily be extended into one.
The entire implementation can be found in my GitHub introduction [3].
Introduce a simple prefix language
Here’s what an example expression in our language looks like:
mul 3 sub 2 sum 1 3 4
At the end of this article, we’ll be able to convert these expressions into JavaScript statements through the typical phase of any compiler.
For simplicity, we need to follow some rules:
-
We only have functions :mul, sub, sum, div.
-
Each string token is separated by a space.
-
We only support natural numbers.
To explore the semantics behind the above expressions, we define some JavaScript functions:
const mul = (… operands) => operands.reduce((a, c) => a * c, 1);
const sub = (… operands) => operands.reduce((a, c) => a – c);
const sum = (… operands) => operands.reduce((a, c) => a + c, 0);
Mul takes multiple operands and passes them through the => operator. The function simply multiplies them, such as mul(2, 3, 4)==24. Sub subtracts each of the passed arguments, and sum is the sum of the calculated arguments.
The above expression can be converted to the following JavaScript expression:
mul(3, sub(2, sum(1, 3, 4)))
or
3 times 2 minus 1 plus 3 plus 4.
Now that we understand the semantics, let’s start at the front end of the compiler!
Note: Similar prefix expressions can be evaluated simply with stack-based algorithms, but in this case we will focus on the concept rather than the implementation.
Develop the front end of the compiler
The front end of any compiler usually has a lexical analysis module [4] and a syntax analysis module [5]. In this section, we’ll build two modules in a few lines of JavaScript code!
Develop a lexical analyzer
The lexical phase is responsible for dividing the program’s input string (or stream of characters) into small chunks called tokens. Tags typically contain information about their type (if they are numbers, operators, keywords, identifiers, and so on), the substrings of the program they represent, and their position within the program. This location is usually used to report user friendly errors in case of invalid syntax structures.
For example, the following JavaScript program:
if (foo) {
bar();
}
An example JavaScript vocabulary analyzer would produce output:
[
{
lexeme: ‘if’,
type: ‘keyword’,
position: {
row: 0,
col: 0
}
},
{
lexeme: ‘(‘,
type: ‘open_paran’,
position: {
row: 0,
col: 3
}
},
{
lexeme: ‘foo’,
type: ‘identifier’,
position: {
row: 0,
col: 4
}
},
.
]
We will keep our lexical analyzer as simple as possible. In fact, we can do it with a JavaScript statement:
const lex = str => str.split(‘ ‘).map(s => s.trim()).filter(s => s.length);
Here, we divide the string with a single space, then map the resulting substring to the repaired version and filter out the empty string.
Calling lexer on an expression yields an array of strings:
lex(‘mul 3 sub 2 sum 1 3 4’)
// [“mul”, “3”, “sub”, “2”, “sum”, “1”, “3”, “4”]
That’s exactly what we were aiming for!
Now let’s move on to parsing!
Develop a parser
A parser (often called a parser) is a module of a compiler that generates an abstract syntax tree from a list of tags (or stream) 6. In this process, the parser generates syntax errors for invalid programs.
Typically, parsers are implemented based on syntax [7]. Here is the syntax of our language:
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
num = digit+
op = sum | sub | mul | div
expr = num | op expr+
The syntax contains numbers that can be combined to form numbers (num). There are four operations; An expression can be a number, or an operation, followed by one or more expressions. We will use individual definitions in the syntax, such as num and op, as rules.
This is known as EBNF syntax [6]. Look at the grammar a little, try to understand it, then forget it completely! After explaining the parser, we’ll get back to the syntax and you’ll see how everything connects together!
As mentioned earlier, a parser is a tool that converts a list of tags into an AST.
For example, for our expression:
mul 3 sub 2 sum 1 3 4
The parser will generate the following AST based on the syntax above:
Let’s explore this algorithm!
const Op = Symbol(‘op’);
const Num = Symbol(‘num’);
const parse = tokens => {
let c = 0;
const peek = () => tokens[c];
const consume = () => tokens[c++];
const parseNum = () => ({ val: parseInt(consume()), type: Num });
const parseOp = () => {
const node = { val: consume(), type: Op, expr: [] };
while (peek()) node.expr.push(parseExpr());
return node;
};
const parseExpr = () => /\d/.test(peek()) ? parseNum() : parseOp();
return parseExpr();
};
The bad news is, there’s a lot going on. The good news is that this is the most complicated part of the compiler!
Let’s break down the code into parts and look at each step step by step.
The node type
const Op = Symbol(‘op’);
const Num = Symbol(‘num’);
First, we define the different node types to be used in the AST, and we just need numbers and operations. For example, the number node 42 would look like this:
{
type: Num,
val: 42
}
The sum operator, applied to 2, 3, and 4, will look like this:
{
type: Op,
val: ‘sum’,
expr: [{
type: Num,
va: 2
}, {
type: Num,
va: 3
}, {
type: Num,
va: 4
}]
}
How simple it is!
Parser
After declaring the node type, we define a function called parse that takes a parameter called a tag. Inside it, we define five more functions:
-
Peek – Returns the current value of the local variable C associated with the tag element.
-
Consum – Returns the tag element associated with the current value of the c local variable and increment C.
-
ParseNum – Takes the current token (that is, call peek()), parses it into a natural number, and returns a new numeric token.
-
ParseOp – we’ll look at it a little bit.
-
ParseExpr – Checks whether the current tag matches the regular expression /\d/ (that is, a number) and calls parseNum if successful, otherwise returns parseOp.
The parsing operation
ParseOp is probably the most complex function of the above parsers. This is because of loops and indirect recursion. Here it is defined again so that it can be explained line by line:
const parseOp = () => {
const node = { val: consume(), type: Op, expr: [] };
while (peek()) node.expr.push(parseExpr());
return node;
};
ParseExpr calls parseOp when peek() returns a non-numeric value. We know this is an operator, so we create a new operation node. Note that we will not perform any further validation, but, in a real programming language, we would like to do so so that a syntax error is reported if an unknown tag is encountered.
In any case, in the node declaration, we set the list of “subexpressions” to an empty list (that is, []), set the operation name to the value peek(), and set the node type to Op. We then loop through all the tags by pushing the currently parsed expression into the list of subexpressions for the given node. Finally, we return the node.
Remember while (peek()) node.expr.push(parseExpr()); Perform an indirect recursion. We get the following expression:
sum sum 2
This will be:
-
Tokens [0] parseExpr Tokens [0] parseOp
-
After parseOp creates an action node, and because of the call to consume(), the c value increases.
-
Next, parseOp will iterate over nodes. For tokens[C], c for 1, parseExpr will be called.
-
ParseExpr finds that the current node is not a number, so it calls parseOp.
-
ParseOp creates another action node and increments C by 1, and loops through all tags again.
-
ParseOp is going to call parseExpr, and c is not equal to 2 anymore.
-
Tokens [2] is “2”. ParseExpr will call parseNum, create a value node, and increment the c variable by 1.
-
ParseNum will return a number of nodes and be pushed into the last operation node of the expression array, which was generated by the last parseOp call.
-
The last call to parseOp will return the action node because peek() will return undefined. ParseNum add c to 3, tokens[3] === undefined.
-
The node returned by the last parseOp call will be returned to the outermost parseOp call and that call will also return the operation node.
-
Finally, parseExpr will return the root action node.
The resulting AST looks like the following:
{
type: “Op”,
val: “sum”,
expr: [{
type: “Op”,
val: “sum”,
expr: [{
type: “Num”,
val: 2
}]
}]
}
The last step is to traverse the tree and generate JavaScript!
Recursive descent resolution
Now, let’s relate the individual functions to the syntax defined above to see why the syntax makes sense. Let’s look at the rules of EBNF syntax:
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
num = digit+
op = sum | sub | mul | div
expr = num | op expr+
Now they’re a little bit clearer, right? Expr looks a lot like parseExpr, where we can either parse out a number or an operation. Similarly, op expr+ looks like parseOp and num looks like parseNum. In fact, parsers often generate parsers directly from the syntax, as they are all directly related to recursive descent parsing algorithms [8].
In fact, we’ve just developed a simple recursive descent parser! Our parser is very simple (we only have four production rules in our syntax), but you can imagine how complex a parser for a real programming language would be.
It’s much easier to develop a language’s syntax by observing its simplified model than by writing a real parser. Parsers contain a lot of detail (for example, many of the syntax structures of your development language), in contrast to extremely simplified and parsimonious syntax.
Developing the compiler
In this part of the article, we’ll walk through the AST of the language and generate JavaScript. The entire compiler is just seven lines of JavaScript code (literally!).
const transpile = ast => {
const opMap = { sum: ‘+’, mul: ‘*’, sub: ‘-‘, div: ‘/’ };
const transpileNode = ast => ast.type === Num ? transpileNum(ast) : transpileOp(ast);
const transpileNum = ast => ast.val;
const transpileOp = ast => `(${ast.expr.map(transpileNode).join(‘ ‘ + opMap[ast.val] + ‘ ‘)})`;
return transpileNode(ast);
};
Let’s explore the implementation details line by line.
First, we defined a function called transpile. It takes an AST generated by the parser as a parameter. Then in opMap, we first mapped data operations to operations in the language. Basically, sum maps to +, MUl maps to *, and so on.
Next, we define a function transpileNode that takes an AST node. If the node is a number, we call transpileNum; otherwise, transpileOp is called.
Finally, we define two functions to handle the translation of a single node:
– transpileNum – Converts a number to a number in JavaScript (simply return the number directly).
– Converts operations to JavaScript arithmetic operations. Note the indirect recursion (transpileOp->transpileNode->transpileOp). For each operation node, we want to first transform its subexpressions. We do this by calling the transpileNode function.
On the last line of the body of the transpile function, we return the result of transpileNode to form the root node of the tree.
Bring it all together
Here’s how we connected all the pieces together:
const program = ‘mul 3 sub 2 sum 1 3 4’;
transpile(parse(lex(program)));
// (3 * (1 + 3 + 4)))
We call lex(program) to generate a list of tags, which we pass to the parse function to generate an abstract syntax tree (AST), and finally, we translate the AST into JavaScript!
conclusion
This article detailed the development of a very simple compiler (or transpile) that compiles prefix expressions into JavaScript. While this only explains the basics of compiler development, we introduce some very important concepts:
-
Lexical analysis
-
Syntax analysis
-
Source code generation
-
An EBNF grammar
-
Recursive descent resolution
If you are interested in further reading, I recommend:
-
Develop statically typed programming languages
-
Construct a simple interpreter
-
Compilers: Guidelines, Techniques, and Tools (2nd edition)
-
Type and programming language
reference
-
Static Code Analysis of Angular 2 and TypeScript Projects – https://mgv.io/ng-static-analysis.
-
Developing Statically Typed Programming Language – https://mgv.io/typed-lambda
-
Tiny Compiler – https://mgv.io/tiny-compiler
-
Lexical Analysis – https://mgv.io/wiki-lex
-
Syntax Analysis – https://mgv.io/wiki-parser
-
Abstract Syntax Tree – https://mgv.io/wiki-case-ast
-
EBNF grammar – https://mgv.io/wiki-ebnf
-
Recursive Descent Parsing – https://mgv.io/wiki-recursive-descent-parser