Author: Zhang Handong
instructions
Illustrating Rust compilers and Language Design
- Focus on diagrams. The purpose of these diagrams is to help developers understand the Rust compiler and language design from an overall structural and semantic level.
- Practice and summarize, not necessarily every month, but go for it.
- Hopefully, it’s written by a lot of people. I’m just throwing a brick. Hard bones, eat together.
primers
As you may have seen in Rust Daily, Microsoft, Amazon, Facebook and other big names are building their own Rust compiler teams and strategically targeting the Rust language. And the Rust Foundation is in the final stages of the process, so it is likely that these tycoons have already joined the foundation.
In sharing “The Five Years of Rust” at RustChinaConf 2020 Conference, I reviewed the development of Rust over the past five years. Although Rust has a good momentum, most of its contributions are actually brought by foreign communities, while domestic communities are ina state of learning and watching. Waiting for the so-called killer app to come along and lead Rust to “go viral.” Why can’t the national community do more to contribute substantively to Rust?
Therefore, as the New Year of 2020 dawns, I have set a five-year Flag: 1, 000 PR for Rust within five years.
Then my friends in the community did a calculation for me: 1,000 for five years, then 200 per year, then 0.5 per day. Some friends also say that Rust’s PR Review cycle is very long, even if you can raise 200 PR a year, the official can not give you that many.
This calculation makes a lot of sense. This goal is really difficult to achieve. In fact, I do not intend to complete this Flag by myself. Instead, I want to encourage the community to be interested in Rust and complete it with my friends. If, in five years, I can get 1,000 people to participate, and each person submits only one PR, then this Flag of 1,000 PR will be easily accomplished.
So, to fulfill this Flag, I have divided the next five years into three phases:
- Phase one: 2021. The goal of this stage is to “ascend the path”.
- The second stage: 2022 ~ 2023. The goal of this stage is to “advance”.
- Phase III: 2024 ~ 2025. The goal of this phase is to “reach the target”.
In other words, this is the year of wanting to get on the road. To achieve this goal, I made the following plan:
- Organize community efforts to translate the official Rust Compiler Development Guide.
- Organize the Rust compiler team, start contributing to the Rust language, and contribute my own learning and experience along the way to the Illustrated Rust Compiler and Language Design series.
Through these two documents, I hope to help and influence more people to contribute to the Rust language.
I know that the compiler as a programmer one of the three romances, water deep. You may also say, they are all PL born compiler, ordinary people do not have that kind of skill. As you might expect, compilers are difficult. But fortunately, difficult is not impossible. No, we can learn. Furthermore, you are not supposed to implement a Rust compiler from scratch.
Contributing to the Rust language is not KPI-driven, but interest-driven. Maybe you’ve read the principles of Compilation, dragon books, Tiger books, whale books, or maybe you’ve implemented a language of your own. But it may never be as rewarding as actually participating in a modern language project like Rust.
So the Article series, Illustrating Rust compilers and Language design, will document not only my own experience learning the Rust compiler, but yours, if you are willing to contribute. In this impetuous world, give yourself a piece of pure land, find the original intention of technology.
Diagram the Rust compilation process
As for learning, I usually start from the whole and periphery to understand the whole picture and structure of an object, and then go into details step by step. Otherwise, it’s easy to get lost in the details.
So you must first understand the Rust compilation process. The diagram below:
The middle part of the figure shows the overall compilation process of Rust code, while the process macros and the interpretation of declaration macros are shown on the left and right sides, respectively.
The Rust language is a programming language based on the LLVM back-end implementation. At the compiler level, Rust compiler is simply a compiler front end that compiles from text code step by step to LLVM IR, which is then passed to LLVM to finally compile and generate machine code, so LLVM is the compiler back end.
The Rust language compiles the overall process
- Rust text code first goes through the “lexical analysis” phase.
Identify elements in the text syntax as “entries”, or tokens, that make sense to the Rust compiler.
-
After lexical analysis, the dictionary entry is converted into an abstract syntax tree (AST) through grammatical analysis.
-
Once you get the AST, the Rust compiler does a “semantic analysis” of it.
In general, semantic analysis is to check whether the source program conforms to the definition of the language. In Rust, the semantic analysis phase continues in two intermediate code levels.
- Semantic analysis HIR stage.
HIR is a compiler-friendly representation of the Abstract Syntax tree (AST), and many Rust syntax sugars have been desgared at this stage. For example, the for loop is converted to loop, if let is converted to match, and so on. Compared with AST, HIR is more conducive to compiler analysis, and it is mainly used in “Type check and Type inference”.
- MIR stage of semantic analysis.
MIR is an intermediate representation of Rust code and is based on HIR to further simplify builds. MIR was introduced in RFC 1211.
MIR is mainly used for borrowing checks. In the early days when there was no MIR, borrowing checks were done in the HIR phase, so the main problem was that the granularity of life cycle checks was too coarse and could only be judged by lexical scope, resulting in a lot of normal code failing to compile due to coarse-grained borrowing checks. The non-lexical scope Life cycle (NLL) introduced in Rust 2018 Edition is designed to address this problem and make borrowing checking more refined. NLL is a term that emerged as a result of the introduction of MIR and the delegation of borrowing checks to MIR, a term that will eventually disappear as Rust develops.
The MIR layer does a lot of work, including code optimization, incremental compilation, UB checking in Unsafe code, generating LLVM IR, and more. There are three key features to know about MIR:
- It is based on the Control Flow Graph.
- It has no nested expressions.
- All types in Mir-are completely clear and there is no implicit expression. Humans are also readable, so you can look at MIR to learn something about the behavior of Rust code during Rust learning.
-
There is also an intermediate code representing THIR (Typed HIR) from HIR to MIR, which is not shown in the figure. THIR is a further degraded simplification of HIR to facilitate the construction of MIR. In the source hierarchy, it is part of MIR.
-
Generate LLVM IR phase. LLVM IR is the LLVM intermediate language. LLVM optimizes LLVM IR to regenerate into machine code.
Why LLVM for the back end? LLVM is not only used by Rust, but also by many other languages, such as Swift, etc. Advantages of LLVM:
- The LLVM backend supports many platforms and we don’t need to worry about CPU, operating system issues (except runtime).
- The LLVM back end has a high level of optimization. We only need to compile the code into LLVM IR, and the LLVM back end can optimize the code accordingly.
- LLVM IR itself is relatively close to assembly language, but also provides a lot of abI-level customization.
The Rust core team also helps maintain LLVM and submits patches when bugs are found. While LLVM has all these advantages, it also has some disadvantages, such as slow compilation. So last year, the Rust team introduced a new backend Cranelift to speed up compilation of Debug mode. Rustc_codegen_ssa, an internal component of the Rust compiler, generates a back-end independent intermediate representation, which Cranelift handles. Starting in January 2021, an abstract interface has been implemented for all backends via RustC_codegen_sSA to allow other source backends (such as Cranelift), which means that the Rust language can be connected to multiple compiled backends (if any) in the future.
This is the overall Rust compilation process. But Rust also contains powerful metaprogramming: “macros.” How does Macro code unfold at compile time? Keep reading.
Rust macro expansion
Rust essentially has two types of Macros: Declarative Macros and Procedural Macros. A lot of people may not understand the difference, but maybe after reading this section.
The statement of macro
Go back and look at the right part of the picture above. As we know, Rust initially parses text code by lexically parsing it to produce tokenstreams. During this process, if Macro code is encountered (either declarative or procedural), a specialized Macro Parser is used to parse the Macro code, expand the Macro code into a TokenStream, and then merge it into the TokenSteam generated by regular text code.
You may wonder why Rust’s macros are handled at the Token level when macros in other languages operate directly on the AST.
This is because the Rust language is still iterating rapidly and the INTERNAL AST changes so frequently that it is not possible to expose the AST API directly for developers to use. Lexical analysis is relatively stable, so Rust macros are currently based on term streams.
Declaring macros, then, is entirely based on TokenStream. The expansion process of declaring macros is actually to replace matched tokens with specified tokens according to specified matching rules (similar to regular expressions) to achieve the purpose of code generation. Because it’s just a Token substitution (which is still more powerful than C macros), you can’t do any kind of calculation in the process.
The process of macro
Declaring macros is very convenient, but because it can only do substitutions, it is very limited. So then Rust introduced process macros. Procedure macros allow you to perform arbitrary calculations during macro expansion. But aren’t we saying that Rust doesn’t expose the AST API? Why are procedure macros so powerful?
In fact, the process macro is also based on the TokenSteam API, but the third-party library author Dtolnay designed a set of out-of-language AST, through this layer of AST operation, to achieve the desired result.
There is no problem that can’t be solved by adding one more layer. If not, add two.
Dtolnay is known in the community as the best API design genius. He created several libraries, such as Serde, which is one of the most used libraries in the Rust ecosystem.
Anyway. Procedure macros work as shown on the left in the figure above. It takes advantage of three libraries, which I call the “process macro trio” :
- Proc_macro2. This library is a wrapper around Proc_macro and is officially provided by Rust.
- The syn. The library is implemented by Dtolnay and generates AST based on the TokenStream API exposed in Proc_Macro2. The library provides a convenient AST operation interface.
- The quote. The library works with syn to roll the AST back to TokenSteam, back to the TokenSteam generated by normal text code.
The whole process of macros is like the ecological cycle of water. The steam comes from the sea (TokenSteam), then through the heavy rain (Syn), falls to the ground (Quote), forming a trickle (proc_macro2::TokenStream) and eventually joins the sea (TokenSteam).
Understanding how process macros unfold will help you learn process macros.
summary
This article focuses on the compilation process of Rust code and the mechanism by which Rust macro code unfolds. Learning these will help you understand the concept of Rust. I wonder if this article piqued your interest in the Rust compiler? The compiler is a deep hole, so let’s dig it out.
Thanks for reading.