4- Parse-HTML parser

Because HTML has such strict syntax rules at the syntactic level that regular parsers cannot parse HTML documents, the solution is to let browser vendors customize the HTML parser. So, let’s take a look at what an HTML parser is

Input (syntax)

Because HTML syntax is defined in a specification created by the W3C organization, and the syntax format is defined by a DTD (Document Type Definition), which defines the elements, attributes, and hierarchies allowed in the language, The SGML (Standard Gerneralized Markup Languge) family of languages for everything. In order to be backward compatible with older versions of content over time, DTDS exist in two modes, the strict one fully adheres to the HTML specification, and the others support the edits used by older browsers.

Parsing algorithm

Because of the syntactic nature of HTML documents (inclusivity) and the presence of scripts that change HTML documents during parsing (e.g. Document.write), it is impossible to parse them using top-down or bottom-up parsers.

The first half of the parsing process is lexical analysis, also known as tokenization. The core of the overall algorithm is the change of the state machine (that is, there is a mark in the parsing process to which stage the current state should be resolved).

Building a DOM tree, also known as tree construction, is the same process we described in “parse-theory Anatomy”, where the corresponding tags hit the syntax and then add them to the DOM tree. There is also a state machine to maintain the phases.

Finally, a DOM tree is a mapping of HTML documents and an interface to the external HTML elements (e.g., to JS). Each node is composed of DOM elements and node attributes. Look at an example:

<html>
  <body>
    <p>
      Hello World
    </p>
    <div> <img src="example.png"/></div>
  </body>
</html>
Copy the code

Deferred mode (that should be executed after the document is parsed) The LOAD event is triggered.

Because the parser is customized by the browser manufacturer, and the HTML syntax is special, the parser should have a fault tolerance mechanism, and this mechanism is not mandatory in the HTML specification, but in the browser development process (friends copy from each other). But some of the later HTML 5 specifications required a fault tolerance mechanism (webKit’s HTML parser has comments like this).

The above describes parsing to HTML documents, but what about the order in which scripts and styles are parsed?

Because of the model-based synchronization of the Web, if an internal

After all, stopping HTML parsing abruptly still affects page display time, so we need to avoid script interruptions unnecessarily by adding the defer attribute to

When a script is in the process of execution, a “pre-parse” is triggered, in which other threads continue to parse the document, find resources that need to be loaded, and load those resources (in parallel, increasing overall speed). Pre-parsing only parses external files (external scripts, styles, or images) and does not modify the DOM tree.

On the style side, because parsing styles do not affect the DOM tree, there is no need to interrupt document parsing. But there is also a problem, when the script gets the style information, but the style is not loaded at this time will report an error. The solution is to block scripts, but different browsers block at different stages. Firefox blocks all scripts when styles are loaded or parsed. Webkit, on the other hand, blocks scripts when they attempt to access properties that are determined to be affected by the loading style.

Now that you’ve finished building the DOM tree, let’s move on to the next stage

Next article

Rendering trees – Theoretical anatomy