preface

This article will simply practice the compiler lexical analysis, considering the complexity of the implementation of lexical analysis, here choose markdown text language as an object for a simple introduction.

The body of the

1. What is a markdown?

Definition:

Markdown is a lightweight markup language created by John Gruber. It allows people to write documents in plain text format that is easy to read and write, and then convert them into valid XHTML (or HTML) documents.

Because Markdown is lightweight, easy to read and write, and has support for images, diagrams, mathematical expressions, and code blocks, it is now widely used by many websites to write help documents or to post messages on forums.

Standardization:

Over time, Markdown has become a typical informal specification and reference implementation for HTML transformation, and many Markdown implementations have emerged. These were developed primarily because of the need for additional features on top of the basic syntax – such as tables, footnotes, definition lists (technically HTML description lists) and markdowns within HTML blocks. At the same time, some ambiguities in the informal specification have attracted attention. These issues have prompted some developers of the Markdown parser to strive for standardization.

RFC 7763 and RFC 7764 were released in March 2016. RFC 7763 introduces the MIME type text/markdown from the original variant. RFC 7764 discussed and registered the MultiMarkdown, GitHub Converged Markdown (GFM), Pandoc, CommonMark and Markdown variants.

2. The relationship between Markdowm and HTML

Here are some examples of how markdown syntax corresponds to HTML syntax

Markdown grammar HTML syntax
1 # title < h1 > title < / h1 >
2 # # title <h2> Title 2 </h2>
Blank lines separate paragraphs <p></p>
Two Spaces at the end of a line indicate a newline <p> <br /> </p>
_ italics _ < em > italics < / em >
The bold < strong > bold < / strong >
` code ` The < code > < code > code
<hr />
* 1 <ul> <li> 1 </li> </ul>

Due to lack of space, there are so many grammars listed here. For more grammars, see the Great Markdown.

Now that we know the syntax mapping between Markdown and HTML, how can we implement a compiler to recognize and convert the syntax?

3. Lexical analysis of Markdowm

With markdown’s syntax standard, you can perform a grammatical analysis, that is, a lexical analysis here, using heading as an example,

First, a lexical parser is needed to produce the corresponding sequence of tokens according to the syntax of heading.

// lexical parser
Class Lexer {
   constructor(options) {
        this.tokens = [];
        this.tokens.links = Object.create(null);
        this.options = options || {};
        this.rules = {
           // Heading matches rules
           heading: # / ^ {0, 3}, {1, 6}) + ([^ \ n] *?) (? # : + +)? * (? :\n+|$)/
           // ...
        };
    }
    // Generate a token sequence
    token(src) {
        while (src) {
            // heading
            if (cap = this.rules.heading.exec(src)) {
                src = src.substring(cap[0].length);
                this.tokens.push({
                    type: 'heading'.depth: cap[1].length,
                    text: cap[2]});continue;
            }
            // other

            // ...
        }
        return this.tokens; }}Copy the code

4. Markdown compiles (parses) HTML

The tokens sequence of Markdown has been obtained above. Now we need to parse it into corresponding tags according to HTML syntax as follows

Parsing & Compiling
class Parser {
    constructor(options) {
        this.tokens = [];
        this.token = null;
        this.options = options || {};
        this.renderer = new Renderer();
    }
    parse(src) {
        this.tokens = src.reverse(); // First in, first out
        var out = ' ';
        while (this.next()) {
            out += this.tok();
        }
        return out;
    }
    next() {
        this.token = this.tokens.pop();
        return this.token;
    }
    tok() {
        switch (this.token.type) {
         // ...
        case 'heading': {
            return this.renderer.heading(
                this.token.text,
                this.token.depth
            )
        }
        // ...}}}// renderer (code producer)
class Renderer {
    constructor(options) {
        this.options = options || {};
    }
    heading(text, level, raw) {
        if (this.options.headerIds) {
            return '<h'
            + level
            + ' id="'
            + this.options.headerPrefix
            + '" >'
            + text
            + '</h'
            + level
            + '>\n';
        }
        return '<h' + level + '>' + text + '</h' + level + '>\n';
    };
}
Copy the code

Afterword.

Since Markdown is a markup text language, there is no need to analyze the abstract syntax tree (the part of generating/analyzing the AST) during code parsing and transformation, which is equivalent to only lexical analysis and code generation. Its compilation and parsing process is simpler than turing-complete languages.

Consider:

Is it possible to build a simple grammar parser based on javascript syntax addition, subtraction, multiplication and division priorities?

For example, var a = 1 + 3 * 2;


  • Related articles

A Preliminary study on code Compilation (Ii)