I plan to open a new reading column, mainly writing some reading notes and understanding of the books I have read to share with everyone. This article is the first of its kind, the Do-it-yourself Lua: Virtual Machines, Compilers, and standard libraries:

Do-it-yourself Lua: Virtual machine, compiler, and standard library

Those of you who have not studied the principles of compiling systematically may wonder how the Lexer & Parser virtual machine is implemented. And suffer from systematic teaching material is too boring.

So in fact, this book as a system to learn the principle of preheating, I think it is very suitable. Even if you are not prepared to study it systematically, you will have a good understanding of how compilation works after reading this book.

In this book, author Zhang Xiuhong implemented a Lua virtual machine using Go. Its code is concise and concise, very easy to learn.

I’m not going to go into the details of this book, but I’m going to use it to show you how to implement a very simple programming language.

The tutorial is too long to be a good example, so I did it in 450 lines of Go code. Here it is:

Starting from the EBNF

Let’s implement a simple programming language called Pineapple (the code can be found at github.com/karminski/p… It’s only 450 lines of code, and the language isn’t even Turing-complete, but it’s enough for a demonstration.

The only function of this language is to assign a value to a variable and then print the variable, like this:

$a = "pen pineapple apple pen."
print($a)
Copy the code

It then prints:

pen pineapple apple pen.
Copy the code

Isn’t that easy? That’s right, and we’re going to do it.

Let’s start with EBNF. You may not understand the following, but you will soon! :

SourceCharacter::= #x0009 | #x000A | #x000D | [#x0020-#xFFFF] /* /[\u0009\u000A\u000D\u0020-\uFFFF]/ */ Name ::= [_A-Za-z][_0-9A-Za-z]*  StringCharacter ::= SourceCharacter -'"'
String          ::= '"' '"' Ignored | '"' StringCharacter '"' Ignored
Variable        ::= "$" Name Ignored
Assignment      ::= Variable Ignored "=" Ignored String Ignored
Print           ::= "print" "(" Ignored Variable Ignored ")" Ignored
Statement       ::= Print | Assignment
SourceCode      ::= Statement+ 
Copy the code

To put it simply, EBNF (Extended Backus — Naur Form) is a language that describes grammars, a blueprint for programming language grammars, with which we can easily implement lexical and syntax parsers for a language.

Let’s start with the first line, SourceCharacter:

SourceCharacter ::=  #x0009 | #x000A | #x000D | [#x0020-#xFFFF] /* /[\u0009\u000A\u000D\u0020-\uFFFF]/ */
Copy the code

This means that the SourceCharacter expression can be in the Unicode #x0009 or #x000A or #x000D or #x0020-#xFFFF range. This is basically the range of strings available. If you can’t read it, it doesn’t matter, we won’t implement it that way, just know about most of Unicode.

EBNF definitions can generate syntax diagrams that look like this:

The diagram is a left-to-right process, starting with the double arrows on the left, where each branch represents a possible process path. Finally, it ends at the opposite arrow on the right.

Then look at the Name expression:

Name ::= [_A-Za-z][_0-9A-Za-z]*
Copy the code

This means that the expression Name begins with an underscore _ or A letter a-za-z, which will be immediately recognizable to those familiar with regular expressions. It is then followed by underscores, upper and lower case letters, and digits (_0-9a-zA-z), with the last * indicating that these can have zero or more. If a type that is not a number starts with a number, then the parser implementation becomes quite complicated.)

From left to right, each branch represents a possible path. The syntax diagram nicely graphs our EBNF definition. Note that some of the branches connected to the middle horizontal line are closed inward and some are closed outward. Closed inward means that this can cycle, i.e. *, and closed outward means that it can only occur once.

Next comes the StringCharacter expression:

StringCharacter ::= SourceCharacter - '"'
Copy the code

Where – ‘”‘ means no double quotes “, that is, StringCharacter is SourceCharacter but does not contain double quotes.

The same String:

String: : ='"' '"' Ignored | '"' StringCharacter '"' Ignored
Copy the code

Represents a String expression by the empty String “” Ignored (inside double quotes is empty) or does not contain the String of double quotes” StringCharacter “Ignored. | representative or, Ignored represent negligible characters, such as space, TAB, Line breaks, comments, etc.

And then Variable expressions:

Variable: : ="$" Name Ignored
Copy the code

Can you see it already? Yes, Variable consists of a dollar sign, $, and a Name expression.

The Assignment expression is also simple:

Assignment  ::= Variable Ignored "=" Ignored String Ignored
Copy the code

Variable, followed by equal =, followed by String. Ignored indicates that the elements can be separated by Spaces, etc.

Print, Statement, SourceCode

Print: : ="print" "(" Ignored Variable Ignored ")" Ignored
Statement   ::= Print | Assignment
SourceCode  ::= Statement+ 
Copy the code

The Print statement consists of Print, open parenthesis (, Variable, right parenthesis).

Statement represents all statements in the syntax. Only Print and Assignment are valid statements.

Finally, the Statement is wrapped in SourceCode. + indicates that SourceCode can contain multiple statements.

At this point, we have completed the EBNF definition of the Pineapple language and learned the syntax of the language.

Come to an end

That’s the end of part 1. Part 2 will teach you how to build a Lexer. Stay tuned ~