This article was first published
CSDN.NETThe December 2017 issue of Programmer magazine

preface

It is generally believed that the front-end is to lay a good foundation of HTML, CSS and JS, deeply understand semantic tags, understand N different layout methods, and master the syntax, features and built-in APIS of the language. Learn some mainstream front-end frameworks and use the community’s established scaffolding to quickly build a front-end project. It’s easy to be good at the front end. Dig a little deeper and you’ll find that the front end is an endless field of frameworks, tools, libraries, and new wheels. Technology has been updated and versions have been updated rapidly, but change is the same. Tools are dedicated to process automation and standardization, serving simple, elegant and efficient coding, and highly abstract and hierarchical problems. In the current situation of front-end open source so hot, the users of the framework and the maintainer of the framework more closely, not only can in-depth source code to understand the framework more thoroughly, but also can ask questions, participate in the discussion, contribute code, solve technical problems together, promote the development and growth of the front-end ecology. Compilation principle, as a basic theoretical discipline, has become one of the theoretical cornerstones of open source front-end frameworks such as Babel, ESLint, Stylus, Flow, Pug, YAML, Vue, React, Marked in addition to the compiler of JS language itself. Understanding how compilation works gives you a better understanding of the framework you’re working with.

What is a compiler?

To the outside world, a compiler is a black box capable of translating one source language into another semantically equivalent target language. From the perspective of modern high-level compilers, the source language is a high-level programming language that is easy to read and write, while the target language is machine language, that is, binary code that can be directly recognized by computers. From the perspective of language system processing, the overall workflow of generating executable program from source program is shown in Figure 1:

Figure 1 Overall workflow flow chart of generating executable program from source program

The compiler is divided into two parts: front end and back end. The front-end includes lexical analysis, syntax analysis, semantic analysis and intermediate code generation, which is machine independent. The representative tools are Flex and Bison. The back-end includes intermediate code optimization, object code generation, machine dependency, and LLVM is a representative tool. In the field of Web front-end engineering, due to the cross-platform characteristics of the host environment browser and Node.js, we only need to pay attention to the front-end part of the compiler, so that we can give full play to its application value. To better understand how the compiler front end works, this article will focus on the widely used Babel as an example of how it compiles source code into object code.

Babel

As a new generation of ES syntax compiler, Babel occupies a very important position in the front-end toolchain. It follows the ECMA-262 language specification to parse the latest syntax without waiting for a browser upgrade to provide support for new features. The syntax parser used inside Babel is Babylon, and the node type definition of the abstract syntax tree (abbreviated as AST) refers to the Mozilla JS engine SpiderMonkey, which is extended and enhanced. It also supports parsing of Flow, JSX, and TypeScript syntax. It uses Babylon to implement two parts of the compiler, lexical analysis and syntax analysis.

Lexical analysis

Lexical analysis is the first part of the processing source program. The main task is to scan the input characters one by one and convert them into lexical unit (Token) sequences, which are passed to the parser for parsing. Token is an indivisible minimum unit. For example, the three characters var can only be considered as a whole and cannot be decomposed semantically. Therefore, it is a Token. Each Token object has type attributes and other additional attributes (operator priority, line and line number, and so on) that can be identified individually. In the Babylon lexical analyzer, each keyword is a Token, each identifier is a Token, each operator is a Token, and each punctuation mark is a Token. In addition, comments and whitespace characters (newlines, Spaces, tabs, and so on) in the source program are filtered out.

Matching rules for tokens can be described based on regular expressions. For example, to match a Token of type Number, you can check if it starts with [0-9] and then loop or recursively scan for contiguous subsequent characters, paying special attention to non-decimal values starting with 0b, 0O, and 0x, scientific notation E or E, and decimal points. The pointer moves backward until the matching rule is not met or the end of the line is reached. Finally, a Token of type Number is generated with attributes such as value and file location, and added to the Token sequence to continue the next round of scanning.

A simple Number type state transition is shown in Figure 2:

Figure 2. Schematic diagram of Number type state conversion

In addition to the Babylon handwritten lexical analyzer, this process can also be implemented using finite automata (DFA/NFA), which automatically converts the input program (pattern matching rules) into a lexical analyzer using the lexical analyzer generator, which is not explained here.

Syntax analysis

Grammar analysis is the next step of lexical analysis. The main task is to scan the Token sequences generated from the lexical analyzer, construct an AST according to the grammar and node type definition, and pass it to the rest of the front-end compiler. Grammar describes the construction rules of programming language and is used to guide the whole process of grammar analysis. It consists of four parts, a set of terminal symbols (also known as tokens), a set of non-terminal symbols, a set of production and an opening symbol. For example, the production representation of a function declaration statement is shown in Figure 3:

Figure 3. Production of function declaration statements

According to the grammar, the parser reads in the tokens one by one and keeps replacing the non-terminal symbols of the grammar until all the non-terminal symbols are replaced by terminal symbols. This process is called derivation. There are two kinds of derivation, left-most derivation and right-most derivation. A leftmost derivation is called if the leftmost non-terminal symbol of the producing body is always replaced first, and a rightmost derivation if the rightmost non-terminal symbol of the producing body is always replaced first.

Parsers are divided into top-down analysis and bottom-up analysis according to the way they work. Top-down analysis requires the AST to be constructed from the top (root) through the leftmost derivation. Common parsers are recursive descent parsers and LL parsers. Bottom-up analysis requires that the AST be constructed from the bottom (leaf node) through the right-most derivation. Common parsers include LR parsers, SLR parsers, and LALR parsers. Both kinds of analysis are practiced in Babylon.

The first is top-down analysis, such as variable declaration statements:

var foo = "bar";
Copy the code

After being processed by a lexical analyzer, a Token sequence is generated:

Token('var') Token('foo') Token('=') Token('"bar"') Token('; ')Copy the code

Recursive descent analysis is performed by the LL(1) parser, which looks forward one input Token at a time to determine which production expansion to use. For the FIRST set of variable declarations (the FIRST Token set derived from the result), simply check that the input Token is one of Token(‘var’), Token(‘let’), or Token(‘const’), and then use this production expansion. First construct AST topmost node VariableDeclaration, add the value of Token(‘var’) to the node attribute, then read the remaining tokens one by one, construct its children according to the non-terminal symbol of the production from left to right, continuously recursive descending analysis. Until all tokens are read in. The AST generated at last is shown in Figure 4:

Figure 4 AST tree generated by top-down analysis

The other is bottom-up analysis, such as member expression statements:

foo.bar.baz.qux
Copy the code

We all know this statement is equivalent to:

((foo.bar).baz).qux
Copy the code

Instead of:

foo.(bar.(baz.qux))
Copy the code

The reason is that the grammar it is designed for is left recursive, and LL parsers cannot parse left recursive grammars. In this case, the AST can only be constructed from the bottom up using LR parsers. Core of LR grammar analyzer is moving – reduction analysis technology, by maintaining a stack, determined by the next input Token is put it into the stack or reduction will be part of the symbol on the top of the stack (the production for production), the first child node structure, tectonic parent again, till all the symbols in the stack reduction. The AST generated at last is shown in Figure 5:

Figure 5 AST tree generated by bottom-up analysis

In addition, the complete AST built from Babylon has special top-level nodes File and Program, which describe the basic information of the File, module type, and so on.

The generated code

Industry-level language compilers usually also have a semantic analysis phase, which checks whether the program context is consistent with the semantics defined by the language, such as type checking, scope checking, and another phase that generates intermediate code, such as three-address code, that linearly describes the program with addresses and instructions. But since Babel is only a translation of ES syntax, this part of the job can be left to the JS interpreter engine. The most distinctive part of Babel is its plug-in mechanism, which calls different Babel plug-ins for different browser versions. Through the interface definition of visitor pattern (a design pattern), the AST is traversed depth-first, and the specified matched nodes are modified, deleted, added, and shifted, so that the original AST is converted to another modified AST.

The interface for a visitor pattern is defined as follows:

Visitor: {Identifier(path) {enter() {visitor: {Identifier(path) {enter() { }, exit() {// walk through the AST to leave the Identifier node to execute... }},... }Copy the code

The last stage is to generate the object code, starting from the root of the AST, recursively descending through, calling a relevant function for each node, performing semantic actions, printing code snippets continuously, and finally generating the object code, which is the code compiled by Babel.

A template engine

As for the template engine, it was first born in the development of server-side dynamic pages, such as JSP, PHP, ASP and other template engines. Since the rapid development of Node.js, the front end has produced a lot of wheels, including EJS, Handlebars, Pug (formerly known as Jade), Mustache and so on. Too many to count. Template-engine technology makes rendering views with data more flexible and opens up more possibilities for logical abstraction, with data and content independent. There are many ways to implement template engines. A relatively simple template engine can be realized by using string substitution and splicing directly, while a more complex template engine, such as Pug, will have a relatively complete lexical analysis and grammar analysis process, which will precompile the template into JS code and then dynamically execute it.

For example, a template statement:

h1 hello #{name}
Copy the code

The AST generated by the Pug parser is shown in Figure 6:

Figure 6 the AST generated by the Pug parser

The object code generated by the generator is (pseudocode) :

'<h1>' + 'hello' + name + '<h1>'
Copy the code

Call new Function at runtime to dynamically execute code:

var compiledFn = new Function('local', `
  with (local) {
    return '<h1>' + 'hello' + name + '<h1>'; 
  }
`)

compiledFn({
  name: 'world'
})
Copy the code

Finally output the HTML statement:

<h1>hello world</h1>
Copy the code

The whole process consists of two parts, the precompilation phase and the runtime phase. Of course, a good templating engine also considers functionality, performance, and security. Avoid the ‘with’ statement above, and introduce caching, XSS defenses, and a more powerful, user-friendly, and easy-to-use syntax.

It is also worth mentioning that the front-end MVVM frameworks represented by Angular, React and Vue all introduce template compilation technology. As a progressive front-end solution, Vue is favored by many developers. It provides rendering functions and templates for view rendering. Using rendering functions requires calling core API to build Virtual DOM types. The process is relatively complex and the amount of coding is very large. Once DOM levels are nested deeply, it will be difficult to control and maintain the code. Another way to deal with this complexity is to write HTML-based templates, add vUe-specific syntax such as tags, instructions, interpolation, etc., and let the compiler compile and optimize from template to render functions. It is more elegant, convenient, and easy to code.

CSS preprocessor

The front-end layout has evolved from the slash-and-burn era of pure CSS to the pre-processing languages represented by Sass, Less and Stylus, giving CSS the ability to programmatically define variables, functions, expressions, and modularity, greatly improving the productivity of developers. These are all changes brought about by compilation technology. Similarly, the compiler performs lexical analysis of the original style code to produce Token sequences. Next, the parsing generates an intermediate representation, an AST that conforms to the definition. At the same time, a symbol table will be established for each program block to record the name and attribute of variables, to provide help for variable scope analysis in the code generation stage. Finally, you recursively descend to the AST to generate CSS code that can be executed directly in the browser environment.

Take the Stylus syntax of the preprocessor for example:

foo = 14px

body
  font-size foo
Copy the code

The compiled AST is shown in Figure 7:

Figure 7 The AST generated by the Stylus parser

The final generated object code is:

body {
  font-size: 14px;
}
Copy the code

The seemingly simple and easy code conversion behind, the compiler for us to do a lot of grammar level of processing, CSS has never had a powerful expansion ability, as well as the bottom of the compilation speed of continuous optimization, so that CSS writing more concise and efficient, easy to maintain and management.

Write in the last

The purpose of this article is to tell the reader that compilation principles are widely used in the field of front-end engineering and can be used to help us solve engineering difficulties. Of course, in the actual coding process, you need to be very patient, careful, consider all kinds of grammar, analysis methods, optimization methods, write good test cases and so on. A good compiler needs to be carefully polished, constantly optimized and upgraded to serve developers in all directions. If you haven’t studied the principles of compilation before, it is recommended to find a book and go through the system. Even though I can’t touch the principle of compilation in my daily work, it is more or less helpful to the accumulation and mastery of basic knowledge, the understanding and understanding of programming language, the learning and application of framework, and the development of my future career.