Last year, I discussed a series of code codification with the company’s bigwigs and took some notes. After that, I started experimenting with all sorts of things: how to convert code to change code. I had some ideas at first, but after a year, I practiced slowly and gained some new harvest.
Here’s what we want to do: convert any A language into any B language. In this way, we can:
- Quickly rewrite any system.
- Modeling domain independent of programming language.
- Produce a more powerful DSL.
- Create a new language.
Introduction 0: The Unified Language Model
The unified language model (UCM) abstracts different programming languages and uses the same set of data structures to describe programming languages.
After I implemented Coca using Golang + Antlr, I realized that this was a viable solution. However, due to the limited architecture and usage of Coca, and Antlr’s support for Java is much better than Go’s, I did not continue to implement this solution on Coca.
This led to my second attempt to use Kotlin + Antlr to achieve model unification across languages, which is my other open source project Chapi. However, as I kept trying, I found that the difficulty and workload were quite large:
- Write parsing for different languages. There are already plenty of full-fledged wheels on the community, the most famous of which is antLR-related parsing. The officially maintained code repository (Grammars-V4) contains numerous examples of Antlr syntax parsing, finding both mainstream and non-mainstream implementations on the market.
- Design a unified language model. That is to design a set of language patterns compatible with different languages. Of course, this is a constant process of improvement, and it will become more complete and complex as more languages are added.
- Parsing different languages. That is, according to the syntax characteristics of different languages, conversion to the above model.
In terms of difficulty, we can see that the technical difficulty is mainly in steps 1 and 2. Step 3, on the other hand, is a very tedious, heavy workload of physical work. We also need to be familiar with the different programming languages and parse the corresponding fields one by one to convert each language.
So I tried to build Chapi’s community, and then led a group of people by hand. Although, I have established a uniform writing pattern for different languages: TDD + Tasking. It seems that many people are a little worried about AST, so very few people participate. As a result, support for other languages languishes.
Related resources:
- A detailed design can be found in my article “How to Model Code?”
- Detailed implementation can be found at github.com/phodal/chap…
Introduction 1: Behind the syntax highlighting
At the same time, even with enough people, Antlr is not a perfect answer. While writing support for different languages, I still ran into a number of issues with Antlr syntax not being supported. Like JavaScript Import, some of Java’s Lambda problems… . In other words, Antlr officials only maintain such a library, the real effect is unknown.
So I went back to the old way of using regees — without writing them myself, of course. In that article, IDE Support for Programming Languages, I talked about implementing parsing based on regular expressions. I described two editors:
- Sublime Text YamL-based regular matching: Sublime Syntax Files
- Textmate, VS Code jSON-based regular matching method: Language Grammars
So, we chose VSCode as the language behind the parsing. Under this model:
- We have a mature and stable set of language parsing tools, and a large team to maintain them.
- Its community is very large and has gone through a lot of iterations.
So my colleagues and I started a few years ago: github.com/phodal/scie… A library based on TextMate syntax highlighting.
Introduction 2: Code generation and JavaPoet
After we roughed out Scie, I started thinking about the next step: how to switch from A to B, and I took some inspiration from JavaPoet. JavaPoet is a Java API for generating.java source files. Here is a simple Example of JavaPoet code:
TypeSpec helloWorld = TypeSpec.classBuilder("HelloWorld")
.addModifiers(Modifier.PUBLIC, Modifier.FINAL)
.addMethod(main)
.build();
Copy the code
That is, we can write an API to convert a language into B language source code. To implement any language transformation, we need to implement a DSL that describes the differences between the different languages and the unified model. Later, I realized THAT I needed another DSL to transform the unified model into different languages.
Introduction 3: Evolution of intermediate representation
The core data structure of a compiler is an intermediate form of the program being compiled. — Compiler Design
Theoretically, through the above two approaches, we can directly generate models for different domains. But, for debugging purposes, creating an intermediate language to host them would allow us to do something more interesting, to unify compiler optimizations — nonsense, of course.
Following the project, I studied the implementation of Proguard + D8 and Android R8 for a short period of time. They both do similar things, taking.class bytecode, compiling and optimizing it, and converting it into dex on Android phones. Of course, switching to Aot is a more interesting topic (though I’m not familiar with it). However, a series of intermediate states are involved: Java ->.class ->.dex -> odex ->.oat. From Java code to JVM virtual machine bytecode -> Dalvik virtual machine bytecode -> optimized Dalvik bytecode -> ART machine code.
Now, the coding language itself is an intermediate representation, because the machine runs in machine code. That is, the classic saying: code is written for people.
Introduction 4: DSL of DSL
For some compilers, they may have a unique IR (Intermediate representation), or they may have a series of IRs. Some of the most common implementations we see are languages that use LLVM as a back end to generate an intermediate form of LLVM IR. Similarly for what we want to do, we can design a high-level intermediate representation similar to LLVM IR to host the design of the language.
Since the project involved a bit of code optimization, I also read the book Advanced Compiler Design and Implementation, which introduced ICAN as an intermediate language. Well, that’s the result of the argument, it’s no longer necessary for me to argue that it’s necessary. So the next step is:
Bootloader, in computer science, is a technique used to generate a self-compiled compiler, that is, a compiler written in the source programming language in which it is intended to be compiled.
In the industry, bootstract is often defined in the compiler domain. But it can be applied in many more areas. For example, the Java build tool Gradle uses Gradle to build itself — which is relatively easy to do compared to a programming language.
And the ego of human is to replace oneself, let the tool do its own thing, let others do their own thing. So we need Charj to do what we can.
Charj Lang
Finally, back to business. With the above steps in place, we can:
- Regular expressions are used to parse and generate syntax trees of different languages.
- Write the Poet API to convert the above syntax tree into the source code for a particular language.
- Design an intermediate language, which can be used as the carrier for converting A language into C language.
- Implement A language to C language, or C language to A language free conversion.
This is the idea of translating from any language to any language. So my colleagues and I set out to design an intermediate language: Charj.
Of course, the main purpose of developing a language is to exercise their ability, whether it is abstract ability, or algorithm ability and so on. In this long life, it will make sense. In the future, please call me Charj language author. PS: You can also be a Charj language writer.
In retrospect, I’ve tried to build a variety of tools, from various editors to various command-line tools. After learning Rust, I’ve explored the JVM, the editor bottom layer, and I’m trying to create tools for everyday use. In the past year, I have written the refactoring tool Coca, and then Chapi, which transforms into a unified language model. I already have a lot of experience with the compiler front end. Naturally, creating a language was the next step.
Why is it called Charj?
In the original sense, Char is more suitable for the definition of Charj, but Char (Cangjie) has been registered as a trademark. The next best thing I can do is call Charj, which can be extended to a mixture of Chinese and English: a Char set (Ji), or a Char set (Ji). Or “character J” — I’m not sure what J means. We can redefine it and give it a new name.
Charj progress
Charj was written in a language dominated by Rust. Rust’s bootstrap has proven that Rust is fine for developing programming languages. Of course, the main reason is that I would rather write Haskell than C++.
Charj Lang (under design)
Charj Lang’s current work is divided into two parts:
- Perfect grammar design
- Compiler flow design
Although Charj does not need to be compiled + runnable in theory, we need them for bootlifting. So we used LLVM on the back end and LALrPOP, the LR (1) parser generator in Rust, on the front end.
GitHub:github.com/charj-lang/…
Charj IDE (under development)
There is already a simple language plugin, of course, with basic highlighting and jumps. If you have some experience in IDEA plug-in development, you can also join us.
GitHub:github.com/charj-lang/…
Scie
Scie (Simple Code Identify Engine) is a common language converter based on regular expressions. The main development work is almost complete, but there are a few issues that need to be resolved:
- The efficiency of optimization
- Random error occurs when calling Oniguruma FFI.
GitHub:github.com/charj-lang/…
Charj Poet (in development)
Charj Poet is a Rust API for generating Charj code. Plan to refine the grammar after it is designed.
GitHub:github.com/charj-lang/…
Poet DSL (to be determined)
Part two:
- That is, to design a new DSL that describes the DSL from different languages to Charj Lang.
- That is, to design a new DSL that describes the DSL that Charj Lang converts to a different language.
website
The official website of the Humble and rough: Charj-lang.org/
other
At this point, ALTHOUGH I’ve read several compilation books, I’m not an expert on how compilation works. So, if you are also interested, welcome to join us.