preface
In “A blog with VuePress + Github Pages”, we build a blog with VuePress and see what the final TypeScript Chinese document looks like.
In the process of building a blog, we explained how to write a markdown-it plug-in in the “VuePress Blog Optimization extension Markdown Syntax” for practical needs. In this article, we will go deep into the source code of Markdown-it and explain the implementation principle of Markdown-it. It aims to give you a deeper understanding of Markdown-it.
introduce
To quote markdown-it Github’s introduction:
Markdown parser done right. Fast and easy to extend.
You can see that Markdown-it is a Markdown parser and is easy to extend.
The demo address is: Markdown-it-github. IO /
Markdown-it has the following advantages:
- Follow the CommonMark Spec and add syntax extensions and syntax sugar (e.g., URL auto-recognition, special treatment for printing)
- Configurable syntax, you can add new rules or replace existing rules
- fast
- The default security
- There are many plug-ins and other packages available in the community
use
// install markdown-it --saveCopy the code
// node.js, "classic" way:
var MarkdownIt = require('markdown-it'),
md = new MarkdownIt();
var result = md.render('# markdown-it rulezz! ');
// browser without AMD, added to "window" on script load
// Note, there is no dash in "markdownit".
var md = window.markdownit();
var result = md.render('# markdown-it rulezz! ');
Copy the code
The source code parsing
If we look at the entry code for Markdown-it, we can see that the code logic is clear:
// ...
var Renderer = require('./renderer');
var ParserCore = require('./parser_core');
var ParserBlock = require('./parser_block');
var ParserInline = require('./parser_inline');
function MarkdownIt(presetName, options) {
// ...
this.inline = new ParserInline();
this.block = new ParserBlock();
this.core = new ParserCore();
this.renderer = new Renderer();
// ...
}
MarkdownIt.prototype.parse = function (src, env) {
// ...
var state = new this.core.State(src, this, env);
this.core.process(state);
return state.tokens;
};
MarkdownIt.prototype.render = function (src, env) {
env = env || {};
return this.renderer.render(this.parse(src, env), this.options, env);
};
Copy the code
As can be seen from the render method, its rendering is divided into two processes:
- Parse: Parse the Markdown file into Tokens
- Tokens generate HTML
Much like Babel, except Babel converts to an abstract syntax tree (AST), which Markdown-it chose not to use, mainly because of the KISS(Keep it Simple, Stupid) principle.
Tokens
What do Tokens look like? Let’s try it on the demo page:
The Token generated by # header is in the following format:
[{"type": "heading_open"."tag": "h1"
},
{
"type": "inline"."tag": ""."children": [{"type": "text"."tag": ""."content": "header"}]}, {"type": "heading_close"."tag": "h1"}]Copy the code
You can view the parameter meanings of the Token Class.
You can also see the difference between Tokens and AST through this simple example of Tokens:
- Tokens is just a simple array
- The start tag and the close tag are separated
Parse
Look at the code for the parse method:
// ...
var ParserCore = require('./parser_core');
function MarkdownIt(presetName, options) {
// ...
this.core = new ParserCore();
// ...
}
MarkdownIt.prototype.parse = function (src, env) {
// ...
var state = new this.core.State(src, this, env);
this.core.process(state);
return state.tokens;
};
Copy the code
/parse_core: parse_core.js: parse_core.js: parse_core.js: parse_core.js:
var _rules = [
[ 'normalize'.require('./rules_core/normalize')], ['block'.require('./rules_core/block')], ['inline'.require('./rules_core/inline')], ['linkify'.require('./rules_core/linkify')], ['replacements'.require('./rules_core/replacements')], ['smartquotes'.require('./rules_core/smartquotes')]];function Core() {
// ...
}
Core.prototype.process = function (state) {
// ...
for (i = 0, l = rules.length; i < l; i++) { rules[i](state); }};Copy the code
As you can see, the Parse process has six rules by default:
1. normalize
In CSS, we use normalize. CSS to smooth out the differences. Here is the same logic.
/ / https://spec.commonmark.org/0.29/#line-ending
var NEWLINES_RE = /\r\n? |\n/g;
var NULL_RE = /\0/g;
module.exports = function normalize(state) {
var str;
// Normalize newlines
str = state.src.replace(NEWLINES_RE, '\n');
// Replace NULL characters
str = str.replace(NULL_RE, '\uFFFD');
state.src = str;
};
Copy the code
We know that \n matches a newline character and \r matches a carriage return character, so why replace \r\n with \n?
We can find the history of \r\n in teacher Ruan Yifeng’s “Carriage Return and Line Feed” :
Before computers, there was something called a Teletype Model 33 that could type 10 characters per second. One problem is that it takes 0.2 seconds to type a newline, which is exactly two characters. If a new character is passed within 0.2 seconds, the character will be lost.
So the developers figured out a way to solve this problem by adding two ending characters to each line. One, called a carriage return, tells the typewriter to position the print head on the left edge. The other, called “line break,” tells the typewriter to move the paper down one line.
This is where “line feed” and “carriage return” come from, as can be seen from their English names.
Later, computers were invented, and these two concepts were applied to computers. Back then, memory was expensive, and some scientists thought it was wasteful to add two characters to the end of each line; just one would do. So, there was a disagreement.
On Unix, each line ends with only “< newline >”, i.e. “\n”; On Windows, each line ends with “< enter >< newline >”, i.e. “\r\n”; On the Mac, each line ends with “< enter >”. As a direct result, a file on Unix/Mac opens on Windows and all the text becomes a single line. Windows files opened on Unix/Mac may have an extra ^M symbol at the end of each line.
The reason \r\n is replaced with \n is to follow the specification:
A line ending is a newline (U+000A), a carriage return (U+000D) not followed by a newline, or a carriage return and a following newline.
U+000A indicates LF and U+000D indicates CR.
In addition to substituting carriage returns, the source code also replaces NULL characters. In re, \0 means matching NULL (U+0000) characters.
A Null character, also called an end character or NUL, is a control character with a value of 0.
Nulls are included in many character encodings, including ISO/IEC 646 (ASCII), C0 control code, universal character set, Unicode and EBCDIC, and almost all mainstream programming languages include nulls
The original meaning of this character is similar to that of the NOP command. When sent to a lister or terminal, the device does not need to do anything (although some devices will mistakenly print or display a blank).
Instead, we replace the empty character with \uFFFD. In Unicode, \uFFFD stands for the substitution character:
The reason for this substitution is to follow the specification, we refer to CommonMark Spec 2.3:
For security reasons, the Unicode character U+0000 must be replaced with the REPLACEMENT CHARACTER (U+FFFD).
Let’s test this effect:
md.render('foo\u0000bar'), '<p>foo\uFFFDbar</p>\n'
Copy the code
The effect is as follows, and you’ll notice that the invisible null character is displayed by replacing it with a replacement character:
2. block
The rule of blocks is to identify blocks and make tokens. What are blocks? What is inline? We can also find the answer in the Blocks and Inlines section of the CommonMark Spec:
We can think of a document as a sequence of blocks — structural elements like quotations headings, rules, and code blocks. Some blocks (like block quotes and list items) contain other blocks; Keywords heading and distribution contain text, links, images, code spans, and so on.
Translation:
We think of a document as a set of blocks, structured elements like paragraphs, quotes, lists, headings, code blocks, and so on. Some blocks (like quotes and lists) can contain other blocks, and others (like headings and paragraphs) can contain inline content such as text, links, emphasis text, images, code snippets, and so on.
Of course inmarkdown-it
, which ones are identified as blocks, you can checkparser_block.js, there are also some rules for recognizing and parsing:
Let me pick out a few of the more unusual ones:
The code rule is used to identify Indented code Blocks (4 Spaces Padded) in markdown:
The fence rule is used to identify Fenced code blocks, in markdown:
Hr rules are used to identify newlines, in markdown:
Reference rules are used to identify reference links, in Markdown:
Html_block is used to identify HTML block element tags in markdown, such as div.
Heading lheading is used to identify Setext headings, in markdown:
3. inline
Inline rules are used to parse the tokens in markdown and create tokens that contain inline.
Let me pick out a few of the more unusual ones:
The newline rule is used to identify \n, replacing \n with a token of type hardbreak
The backticks rule is used to identify backticks:
Entity rules are used to handle HTML entities, such as { “¯ “” Such as:
4. linkify
Automatic link recognition
5. replacements
Replace (c) ‘ ‘(c) with © and???????? Replace??? Will!!!!!!!!!!! Replace!!!!!! And so on:
6. smartquotes
For printing purposes, the straight quotes are treated:
Render
The Render process is actually quite simple. If you look at renderer.js, you can see that there are some default rendering rules built in:
default_rules.code_inline
default_rules.code_block
default_rules.fence
default_rules.image
default_rules.hardbreak
default_rules.softbreak
default_rules.text
default_rules.html_block
default_rules.html_inline
Copy the code
The code_inline rule is as follows:
default_rules.code_inline = function (tokens, idx, options, env, slf) {
var token = tokens[idx];
return '<code' + slf.renderAttrs(token) + '>' +
escapeHtml(tokens[idx].content) +
'</code>';
};
Copy the code
Custom Rules
Markdown-it provides a way to customize both the Rules in the Render process and the Parse process. This is also the key to writing markdown-it plug-ins. We’ll talk about that later.
series
Blog Building is the only practical tutorial series I’ve written so far, explaining how to use VuePress to build a blog and deploy it on GitHub, Gitee, personal servers, etc.
- Build a blog with VuePress + GitHub Pages
- This article will teach you how to synchronize GitHub and Gitee code
- Can’t use GitHub Actions yet? Look at this article
- How does Gitee automatically deploy Pages? GitHub Actions again!
- A Linux command with an adequate front end
- A simple enough Nginx Location configuration explained
- A detailed tutorial from buying a server to deploying blog code
- A domain name from the purchase to record to resolve the detailed tutorial
- VuePress’s last updated date is set
- VuePress blog optimized to add statistics features
- VuePress blog optimized for HTTPS enabled
- VuePress blog optimization to enable Gzip compression
- Implement a VuePress plug-in from scratch
Wechat: “MQyqingfeng”, add me Into Hu Yu’s only readership group.
If there is any mistake or not precise place, please be sure to give correction, thank you very much. If you like or are inspired by it, welcome star and encourage the author.