Build a tiny compiler with Swift

For most developers, despite the fact that we deal with compilers on a daily basis, compilers are still a mysterious black box. In this a try! In Swift’s presentation, Samuel Giddins learned the basics of how a compiler works by building a new mini-compiler from scratch for his own programming language. He also describes how Swift provides elegant solutions to complex problems such as semantic parsing, lexical analysis, and code generation. Finally, we’ll implement and compile a brand new programming language!

If you’re interested in this topic, you can find the full code in the Segiddins /Sipquick repository on Github.

An overview of the(00:00)

I built a programming language called “Sipquick”, and THEN I built a compiler and test cases for it using Swift.

So what does this tiny compiler look like? Only 310 lines of code! I don’t count the Swift compiler’s lines of code, but I’m guessing it’s over 100,000. So, by comparison, my compiler certainly deserves the name “tiny.”

Why write a compiler?(00:53)

Compilers are something every developer deals with every day. So I’m curious about how the compiler works.What happens when I run Swift, C, or Clang?
This is a well-known problem area. There are now hundreds of different kinds of compilers. You can go to Google“How do I execute in the compiler…”You’ll get some better answers than you would on Stack Overflow. You might find tweets on Twitter, blog articles, and related books.
These suggestions work in any language. It is essentially a program that converts one text file into another.
I had never written a compiler before. But now I can put “Compiler Engineer” on my resume.

What does a compiler look like?(2:00)

The compiler is very simple, just a function:

Let compiler: (String) -> Data /* Executable */Copy the code

This function takes a string as an argument, which we call a “source file,” and outputs an executable file. It converts programs written in a language that we can read, understand, write, and analyze into another language that machines can read, understand, analyze, and execute.

So, how do we write a compiler? Here are some well-known steps:

Parsing source code
Lexical analysis
Do what’s called “semantic analysis.” For example, “Now here are some tokens, what do they mean? In Swift,ClassWhat does this word mean?”
To optimize the
Generating machine code

Sipquick(03:48)

The very nature of a compiler is that it must be highly functional. I built a programming language called “Sipquick”. This is a fictional book about a bird with a very loud, noisy call.

It’s a very crude language that I invented specifically for this presentation. If I tried to write a compiler for an existing language, I would have to implement a lot of things, which would make my job difficult and complicated.
This is an S-expressions based language (similar to Lisp). All content is enclosed in parentheses, so your compiler must be able to automatically match parentheses so that we can easily ensure that all content will compile successfully.

S-expression: The full name of s-expression is Symbolic Expression. It can be an atom or a list of parentheses. For example, (x, y), x, and y are all s-expressions.
This is a dynamic language, so there is no typing at compile time.
The compiler doesn’t know about types; it’s a functional language. By its very nature, this requires every function to take arguments and return something, which can cause some side effects in the language.

(def hello name (+ "Hello, " name)) (print (hello "world!" ))Copy the code

This is why we wrote “Hello, World!” in Sipquick. The way. Def is the keyword that defines the function, which is called Hello. It takes a parameter named name, and then it concatenates with a string. In the second line, we call this function and print it out.

The parser(05:41)

First we need to build a Parser. I am a try! Most of the code was outlined in a presentation at Swift Tokyo. It’s a parser Combinator — there’s some really interesting stuff in there, similar to Monad. But I didn’t like the fancy operators — I changed them to something that could be written and read.

This is the main parsing process:

let schemeParser: Parser<String, [Sexp]> = {return ignoreSecond(Sexp ().then (unichar("\n").maybemany ().fmap {$0.map {$0.0}} .then(dropEndingMetadata().or(empty().andThatsAll()))) }Copy the code

I’ve omitted all the configuration of the parser here, as well as some of the s-expression definitions. But the basic idea is that we need to parse in a set of S-expressions, and then as soon as we hit the end of the file or the newline flag, we’re done parsing the line of code. We end up with an array of S-expressions.

There are all kinds of generics here. Compilers hate generics, but we can convert strings and tokens we get into S-expressions.

Lexical analyzer(06:47)

_.fmap { Sexp.single($0) }
_.fmap { Sexp.many($0) }
_.fmap { _ in Sexp.none }
Copy the code

During parsing, we convert what we parse into an AST (Abstract Syntax Tree). I can parse out “These are empty parentheses and these are empty S-expressions” to make the analysis process easier. I don’t have to worry about what these strings mean in different places. This is done in the parser. Swift file.

Semantic analysis(07:45)

let sema: ([Sexp]) -> [Expression]
Copy the code

Now that we’ve finished parsing, we need to process the tokens. However, tokens have different meanings in many languages. How do you define a function? How do you define a variable? There are also keywords. Are these operators? How do different companies work together?

This is called Semantic Analysis: analyzing an abstract syntax tree and collapsing it into Expression so that a programming language can analyze it:

struct Expression { init(sexp: Sexp) { switch exp { ... }}}Copy the code

This initialization statement contains a bloated switch statement. I’ve been working on this thing for the last week. I won’t describe this code in detail here, because I don’t think this approach occurs in the semantic analysis of Swift itself. We select what we are looking for, and then convert these Lisp-like s-expressions into expressions that the compiler can understand to understand whether the Expression defines a variable, a function, or a function call.

In most compilers, we might check “Is this program correct? Is it constructed correctly?” . In a language like Swift, this also means that we need to check “is this type correct? Is this function called with the right number of arguments? Does this function exist?” . The language we’re writing now doesn’t need to do that. In addition, our compiler does not perform optimizations.

To optimize the(09:41)

Optimization, ironically, is one of the simplest things a compiler can do:

let optimizations: Array<[Expression] -> [Expression]>
Copy the code

Optimization is “Here are some corresponding expressions, and then we need to convert them to another expression”. In most well-seasoned compilers, you’ll probably see a lot of optimization. In this process, you will do the following:

Is there unreachable code? If so, delete it.
Are there any reusable pure functions? If so, we can evaluate the value and put the result directly into the place where it will be reused.

I haven’t done a single optimization here, but it’s one of the most important parts of the compiler’s execution, so people spend a lot of time doing it. If I spend 10 hours tuning to make the code run 1% faster, it’s worth it. Because a large number of users use compilers, a 1% optimization could mean thousands of hours of compilation saved for users. In addition, compiler optimization is also the focus of academic research, we need to study which conversion is both safe and fast.

Code generation(11:31)

Code Generation is the last step, at which point I understand the intent of the entire program. The compiler already knows what the entire program needs to do, but now it needs to output what the computer can run:

let codeGen: ([Expression]) -> Data
Copy the code

We take an array of expressions as arguments and convert them to executable files.

What is an executable file? The job of generating the machine code is left to Clang and LLVM. If you’ve ever used Babel for JavaScript, it’s essentially a compilation of one form of JavaScript into another; It is still essentially a compilation process.

I find the code generation step very difficult. The simplest way to do this is to import LLVM and hand the relevant data to LLVM, which does a lot of optimization and generates great machine code. LLVM is great if you’re using something like C. LLVM is written specifically for C, C++, and Objective-C compilers.

The language I’m building here is very dynamic, without any type, and we don’t know whether the variable is a string or an integer, or whether the function actually exists. I tried to adapt it to LLVM, but it didn’t work.

On slide 27, this is the idealized “how modern compilers work.” All of the LLVM operations happen at the end of the compilation, but I can’t do that now. I had to do the hardest part myself.

I compiled my Lisp-like language into Ruby. Ruby is a dynamic language, and it has a very stable standard library so I don’t have to write things like, “How do I pass this parameter to the program? How do I input something to the screen?”

extension Expression { func asRuby() -> String { switch self.type { ... }}}Copy the code

I implement the code generation step basically as an extension of Expression. What we do is we say, “Here’s an expression, and we can turn it into Ruby to represent it.”

This operation is recursive. Because we have nested expressions, in the form of function calls, where we make a top-level function call, and then maybe the arguments themselves contain other function calls or other expressions. The asRuby() method will convert the expression and output Ruby code, so as long asRuby is available on your machine, we can run our program.

The gist of this code is that we map the expression through $0.asRuby(), then join the resulting contents together with a newline flag, and add a Shebang (#!) at the top. Symbol, followed by the chmod command, which generates the executable and is ready to run.

So what does this step look like?

; fibonacci.spq
(def fib x (condition (<= x 1) x (+ (fib (- x 1)) (fib (- x 2)))))
(print (fib 10))
Copy the code

Will be converted to:

#! /usr/bin/env ruby def fib(x) if (x.<=(1)) x else (fib((x.-(1))).+(fib((x.-(2))))) end end print(fib(10))Copy the code

Here we add a definition that defines what the NTH Fibonacci number is. Lisp is not the most expressive language because it has a lot of parentheses, but we can compile it into something that looks a lot like Ruby, which is relatively friendly. If you actually run the code, you’ll see that it works.

On slide 32, I listed the entire code generation process. I know you can’t see it, but of course I did it on purpose. This code is in a file, very compact, but also relatively simple to implement, and I have done a good conversion and optimization for Ruby.

It basically selects Expression, then picks out “This is a function definition and these are parameter names,” and then converts it to the corresponding Ruby Expression.

extension Expression { func parenthesize(_ s: String) -> String var isOperatorCall: Bool { get } func asRuby(depth: Int = 0) -> String { switch kind { case .bare: { } case .call: if isOperatorCall { } else { } case .functionDefinition: { } case .empty: { } case .variableDeclaration: { } case .conditional: guard children.count == 3 else { } let conditional = children[0] let positive = children[1] let negative = children[2] { }}}}Copy the code

But other than that, the previous semantic analysis, optimization operations are very simple, we just need to implement “output into an object that reads like the language we are operating in”.

If I were trying to compile the language to Swift or C, this would be a very difficult operation, because how do I tell if a number is an integer or a floating point number? How do you determine if this is a string or a character? How do I allocate stacks, how do I allocate stacks? As a result, when you try to think of the language as a traditional compiler, you’ll find a lot of mismatches. So, this step depends entirely on the language features you want to compile and run.

Use this compiler(they)

$ cat fibonacci.spq
(def fib x (condition (<= x 1) x (+ (fib (- x 1)) (fib (- x 2)))))
(puts (fib ([] ARGV 0)))
$ sipquick fibonacci.spq fibonacci
$ ./fibonacci 10
55
Copy the code

I like the.spq extension. You can write a file that will end with SPQ. You can choose what file it outputs, and then you can pass in parameters and run the program just like any other normal executable.

Test compilerPrevailed (Philistine)

I only wrote integration tests. These tests are mostly about compiling a file, running it, and verifying that the output meets my expectations. Only by testing do I know that the compiler works, so this requires a well-formed program to compile and run, not some other bad example to verify its stability. In addition, I didn’t write any unit tests.

Here is one of the test cases:

(def print_even x (condition (== (% x 2) 0) (print "even") (print "odd")))
(print_even 17)
(print_even 12)
(print_even -1)
(print_even 1)
(print_even (* 0 1))
/////
it allows branching
0oddevenoddoddeven
Copy the code

Basically, I add a Sipquick flag at the top, then five slashes, followed by the name of the test case, the expected exit code, the expected output, and the name of the tester.

This is the format of a standard test. It has a name, the script you are running, the desired output, and the desired exit code. Then you just need to run the code and make sure everything is equal. Then, find the path to the compiler, or one that has all the spec files, and when you do, instantiate the test cases, run them, and see if they pass.

Here is the last test case I wrote this morning:

(def fizzbuzz x (condition (<= x 0) (return "") (condition (== 0 (% x 15)) (+ (fizzbuzz (- x 1)) "fizzbuzz") (condition (== 0 (% x 3)) (+ (fizzbuzz (- x 1)) "fizz") (condition (== 0 (% x 5)) (+ (fizzbuzz (- x 1)) "buzz") (+ (fizzbuzz (- x 1)) (String x))))))) (def fetch_arg position default (condition (! = nil ([] ARGV position)) ([] ARGV position) (return default))) (print (fizzbuzz (fetch_arg 0 100))) ///// it computes fizzbuzz 012fizz4buzzfizz78fizzbuzz11fizz1314fizzbuzz1617fizz19buzzfizz2223fizzbuzz26fizz2 829fizzbuzz3132fizz34buzzfizz3738fizzbuzz41fizz4344fizzbuzz4647fizz49buzzfizz525 3fizzbuzz56fizz5859fizzbuzz6162fizz64buzzfizz6768fizzbuzz71fizz7374fizzbuzz7677f izz79buzzfizz8283fizzbuzz86fizz8889fizzbuzz9192fizz94buzzfizz9798fizzbuzzCopy the code

I tried to implement FizzBuzz, and the test forced me to fix some compiler bugs to make the use case work. I had to figure out why the use case didn’t work for a while, and it turned out that the parentheses didn’t match because my compiler could only handle a single statement on a single line, and I finally got the use case through. I managed to implement FizzBuzz in my own programming language!

Subsequent improvements(PM)

The Swift code I wrote is bad, and I need to use some of itstructIs rewritten asenum;
Optimization is not implemented. Ruby code isn’t the fastest code in the world, and I’m pretty sure that compiling my code to run in Ruby would cause it to run dramatically slower;
You need to be able to define new variables in a function field. Functions and things like that are not ideal right now;
No error message. If not generated correctlyExpression of s –The program simply reports a fatal error. This makes it extremely difficult to figure out what is causing a compile failure;
There are no semantic analysis errors. You can define a function with the wrong number of arguments, you can call a function with the wrong number of arguments, and you can define something that is actually a function as a variable, all of which can still be compiled into Ruby, but should actually be prohibited;
I’m going to explore how to compile code directly into machine code, which I think would be a lot of fun.

In addition:

There is no standard library in place. I’m borrowing Ruby’s standard library right now;
Comments don’t work (I know there are friends out there who want me to build a programming language that doesn’t allow comments);
There is no concept of Lambda expressions, which would be strange if Lisp didn’t have Lambda;
Expressions can contain only one simple expression, so the function body must now be pure. There is no way to print a value and then call another function.

Lessons learned and conclusions(then)

First up: don’t actually write a compiler for a conference presentation! It’s just too much work. Unless you want to add compiler engineer to your job title.

Testing the compiler is essential because I don’t have to write a disposable gadget that only works the first time I use it and then dies.

String parsing in Swift is a real grind. Unicode is good, but it also makes things like parsing source files incredibly difficult.

Error messages are also a hard nut to crack. This seems simple enough, but it’s really hard to report why and what kind of error this place went wrong when parsing, and how do I express it? “You left out a close parenthesis here?”

The LLVM API I want to use is designed specifically for type languages, from which we can infer that the variable is of type Int32, or that the function has a function signature.

Implementing your own programming language is fun and makes sense. On my Github, I implemented a programming language of my own. But in reality, the programming languages that actually work are far better than the ones I did last week, because they take a lot of time, experience, and skill to make. So later on, if I want to complain about what Swift can’t do, I think I’ll probably better understand the mood of the Swift developers. Because this stuff is so hard.

resources

segiddins/Sipquick – GitHub
try! Swift Tokyo, Yasuhiro Inami – Parser Combinator in Swift – SpeakerDeck