21 minutes to learn how to write a compiler

There is a saying on Zhihu that compilers, graphics and operating systems are the three romances of programmers.

Regardless of whether this statement is true or false, let’s say a programmer writes code for a Domestic Internet company and doesn’t read related books in his spare time. How much more will he have lost in three years than he did in school?

Obviously, the proportion of loss must be very high, after all, in the daily development work of domestic Internet companies, programmers rarely touch these three pieces of knowledge. Most programmers, after a few years of work, are only physically responsive to concepts related to the principles of compilation, and have trouble connecting them in their heads.

The concept of compilation principle has the characteristics that make people feel headache when they see it. In school, they need to memorize by rote, and after passing the exam, they are eager to forget all of it. I believe many students will feel pain when they see the following concept:

Nondeterministic finite automata/deterministic finite automata
Quaternion sequence
Context-free grammar /BNF
Terminal/non-terminal
LL(1)/LR(1)
AD hoc syntax guided conversion

If I had to follow the course, I would not have been able to memorize the nouns and definitions in 21 days, let alone 21 minutes. What’s more, how to write a compiler after memorizing these nouns is another problem.

Most of the time, we just want to write a compiler quickly, maybe out of curiosity, maybe to implement our own toy DSL (Domain specific Language), maybe to defend ourselves in a fight.

I’m going to tell you how to write a compiler in 21 minutes. The above crap takes about 1 minute. There are still 20 minutes left.

Before we start working on the compiler, let’s start with a brief introduction to the following content in the form of questions and answers:

What is a compiler?

A compiler in a broad sense can refer to any program that converts code from one language to code from another.

What do you actually do to make a compiler?

The compiler is a whole chain of tools, from lexical analysis and parsing at the front end to presentation generation, inspection, analysis, optimization, and code generation.

If you’re a compiler practitioner, you spend most of your time in the middle; If you’re an amateur, you spend most of your time doing front-end and code generation.

So what kind of compiler are we building today?

Since it is 21 minutes, it is impossible to learn how to write GCC, even if it is used to learn Flex+Bison.

We will first determine the syntax set of the source language, then design the Abstract Syntax tree (AST) structure, write a Parser to convert the source language into an abstract syntax tree, and write a CodeGenerator to convert the tree into the target language.

That is, compared to a normal compiler, we dispense with type checking and analysis optimizations for intermediate representations, and we make the definitions really stupid in order to learn them in 21 minutes.

Next, begin the text.

First determine the source language:

This is a four-part computing language that looks like Lisp, with four binocular operators called “”.

Multinomial four operations can be written like this:

To determine the target language:

Is also a four-way operation language, but looks more readable, the corresponding four binocular operators are respectively.

The above source language example should compile like this:

Finally determine the language in which we will write the compiler:

I chose Haskell for this novel for two reasons. One is that It’s very convenient to write Parser because of its famous name, ParseC. The other is that Haskell’s definition of algebraic data types is itself AST.

The full name for ParseC is the Parser combinator. Parser is a function that takes strings and outputs values of type T. The ParseC library implements a large number of base Parser and Parser parsers, which can combine the base Parser and user-defined parsers into new and more powerful parsers.

For example, you implement a Parser that returns parsed identifier names based on input text. The ParseC library implements a parser combination called many. Combining it with your own Parser produces a new Parser that returns a list of parsed identifiers based on input text.

Why ParseC? Because using ParseC to define Parser has all the benefits of PEG, you don’t have to learn anything outside of the language (such as flex and Bison’s own DSLS).

Of course, other languages have similar libraries, such as c++ with boost::spirit, Java/C#/F#/JS with Haskell’s industrial-grade implementation of ParseC. The difference between these languages and Haskell is that some extra logic is written to convert the Parser results into an AST.

This code is declarative and self-descriptive, if you haven’t worked with Haskell before.

The code begins by defining the structure of the AST so that any source language expression can be described using this structure.

A brief analysis of the source language gives us a straightforward recursive definition of the concept of an expression: an expression is either a literal value or a binocular operator and the evaluation of two expressions.

Then there is the recursive definition of the concept of a literal: a literal is either an integer value or a floating-point value.

In Haskell these two definitions are written as follows:

 BinOp = String
                        
    IntVal Integer
    | FloatVal Float
        deriving (, , )
 
    ConstExp 
    | BinOpExp BinOp  
        deriving (, , )
 Copy the code

Corresponding to the above text definition:

The expression Exp, or a literal expression ConstExp, consists of a Val; Or a binocular operation expression, BinOpExp, consisting of an operator and two exps.
The value Val, or an Integer value IntVal, consists of an Integer; Either a FloatVal value, consisting of a Float.

Next, start writing Parser. The process is to write a Parser for each node type in the AST and then combine these parsers to form a parser that can parse the entire AST.

Let’s set ourselves a small goal, such as implementing an int_Parser first.

p_int :: Parser Integer
p_int  = do s < getInput
             readSigned readDec  
                [(, ')] > n < setInput '
                          > empty         "p_int"

p_int_val :: Parser Val
p_int_val =  IntVal <$> p_int 
             "p_int_val"
 Copy the code

P_int is a Parser definition that can Parse an Integer from text. P_int_val modifies p_int and defines a Parser that can Parse IntVal from text.

Then we combine int and float’s parser to make a val_parser.

p_val :: Parser p_val =  listplus [p_int_val,p_float_val]Copy the code

Listplus can be understood simply as union, with backtracking done in the implementation.

Similarly, we implemented the ConstExp parser and BinOpExp parser parser respectively, and then combined them into exp_parser.

p_const_exp :: Parser Exp
p_const_exp =  ConstExp <> p_parse p_val
             "p_const_exp"

p_bin_op_exp :: Parser Exp
p_bin_op_exp =  p_between '(' ')' inner  "p_bin_op_exp"
    where
        inner = BinOpExp
                 <> p_parse (listplus [string "add", 
                                 string "sub", string "mul", string "div"])
                 <> p_parse p_exp
                 <> p_parse p_exp
                  "p_bin_op_exp_inner"

p_exp :: Parser Exp
p_exp =  listplus [p_const_exp, p_bin_op_exp]
         "p_exp"
 Copy the code

So far, our parser part is complete.

If you are interested in Haskell, you can install ghCI, which is the REPL of Haskell, and load parser. hs from the command line.

Prelude> parse p_exp "" "(mul (sub 5 (add 1 2)) 4)"Copy the code

You can see the output. With a little formatting, the output becomes the familiar tree structure, with the BinOpExp of Op being the root of the tree. The entire output is an AST.

Right (BinOpExp 
             "mul" 
             (BinOpExp 
                 "sub" 
                 (ConstExp (IntVal )) 
                 (BinOpExp 
                     "add" 
                     (ConstExp (IntVal )) 
                     (ConstExp (IntVal )))) 
             (ConstExp (IntVal )))
  Copy the code

With this AST, we are ready to do the rest of the code generation.

The body of the CodeGenerator is a function that converts Exp to target language code:

exp_gen ::  -> Maybe String
 exp_gen (BinOpExp op e1 e2) = 
     <- exp_gen 
     <- op_gen 
     <- exp_gen 
    return (format "({0} {1} {2})" [s1, sop, s2])
 exp_gen (ConstExp val) = 
       
        IntVal int_val -> return (show int_val)
         FloatVal float_val -> return (show float_val)
  Copy the code

Using pattern matching, a language feature, to implement polymorphism is easy and elegant.

Finally, a shell is set, such as reading the source file, writing the target file, and the whole compiler is done.

Now that you’ve seen it, you have a pretty good idea of how to write a compiler quickly.

If you want to write a compiler from scratch, you don’t need to learn Flex and Bison, or recall the principles of compiling, but you still need to know PEG and the ParseC library in your familiar language.

However, this is still the fastest way to access the handwriting compiler. Don’t many students like to write code from the beginning, avoiding painful abstract concepts and starting from scratch?

As mentioned earlier, we implemented a compiler, but it was a stupid compiler, such as BinOp, which was problematic:

BinOp is represented as a string in the AST, so there is no way to check the types of the two operands.
BinOp becomes a special concept, rather than an ordinary function.

If you are interested in solving these problems, you can directly modify the code based on the novel jun, and extend it into your own compiler. The code is posted on Github, but since there is no way to place a link in the feed, the “compiler” will automatically reply to the Github link as long as the background sends a message to the writer.

Some background knowledge in the article cannot be explained one by one due to the length of the novel. Here is a brief reference. Those who are interested in related topics can search the corresponding keywords in the search engine:

For haskell concepts, see Real World Haskell.
Related concepts of ParseC can be found in the literature “The Essence of FP”, “Monads for FP” and “Monadic Parser Combinator”.
It is not recommended to read the Book of the Dragon. If you are interested, you can read the Engineering A Compiler.

If you are interested in your article, please long press the qr code below to identify your attention or share it with your friends.

21 minutes to learn how to write a compiler

Related Posts

ArrayBlockingQueue (ArrayBlockingQueue

Windows NAT port mapping

Spring Boot’s bootstrapfrom starter to master tutorial is finally here