This is the 10th day of my participation in Gwen Challenge.

preface

Earlier, we looked briefly at the basics of an HTML parser. In this article, we’ll look at the implementation together.

Intercept the start tag

As mentioned earlier, each loop is intercepted from the top of the template, so you need to intercept the start tag only if the template starts with the start tag.

So, how do you determine if a template starts with an opening tag?

In an HTML parser, it’s not hard to tell if a template starts with a start tag. You need to determine if an HTML template starts with <.

If the template starts with <, it could be a template that starts with a start tag, it could be a template that starts with an end tag, it could be a comment tag, etc., because these types of fragments all start with <. To further determine whether a template starts with a start tag, you need to use regular expressions to determine whether the start position of the template matches the start tag.

How do I use regular expressions to match templates that start with an opening tag?

const ncname = `[a-zA-Z_][\\w\\-\\.]*`;
const qnameCapture = ` ((? :${ncname}\ \ :?${ncname}) `;
const startTagOpen = new RegExp(` ^ <${qnameCapture}`);


'<div></div>'.match(startTagOpen) // ["<div", "div", index: 0, inputt: "<div></div>"]
Copy the code

After discerning that the template starts with a start tag, you get the tag name, and the attributes and self-closing identifiers need further parsing.

The regular expression above can tell if a template starts with a start tag, but it does not match the entire start tag, but a small portion of the start tag.

The start tag is divided into three sections, the tag name, the attribute, and the end.

The tag name character tells you if the template starts with an opening tag, and further parsing is required to obtain attributes and self-closing identifiers.

Parsing tag attributes

After recognizing that the template starts with a start tag, the tag name is truncated from the start tag, so when parsing the tag attributes, we get the template that looks like this in pseudocode:

' class="box"></div>'
Copy the code

The following pseudocode shows how to parse an attribute in the start tag, but it can parse only one attribute:

const attribute = /^\s*([^\s"'<>\/=]+)(? :\s*(=)\s*(? :"([^"]*)"+|'([^']*)'+|([^\s"'=<>`]+)))? /
let html = ' class="box" id="el"></div>';
let attr = html.match(attribute);
html = html.substring(attr[0].length);
console.log(attr)
// [' class="box"', 'class', '=', 'box', undefined, undefined, index: 0, input: ' class="box" id="el">']

Copy the code

Only one class attribute was resolved, and one ID attribute was not resolved. In fact, attributes can also be parsed in parts, parsed and intercepted in parts.

const startTagClose = /^\s*(\/?) >/
const attribute = /^\s*([^\s"'<>\/=]+)(? :\s*(=)\s*(? :"([^"]*)"+|'([^']*)'+|([^\s"'=<>`]+)))? /;
let html = ' class="box" id="el"></div>';
let end, attr;
const match = {
    tagName: 'div'.attrs: []};while(! (end = html.match(startTagClose)) && (attr = html.match(atttribute))) { html = html.substring(attr[0].length);
       match.attrs.push(attr)
}
Copy the code

If the remaining HTML template does not match the characteristics of the end of the start tag and the attributes of the tag, then the loop is entered for parsing and intercepting.

Finally, the result of match is:

{
    tagName: "div".attrs: [[' class="box"'.'class'.'='.'box'.null.null],
        [' id="el"'.'id'.'='.'el'.null.null]]}Copy the code

The remaining templates look like this:

"></div>"
Copy the code

Parse the self-closing identifier

A self-closing label has no child nodes. In the process of parsing, the self-closing label can be used to determine whether a node needs to be pushed onto the stack.

How to parse the end tag in the start tag?

function parseStartTagEnd(html){
    const startTagClose = /^\s*(\/?) >/;
    const end = html.match(startTagClose);
    const match = {};
    if (end) {
        match.unarySlash = end[1];
        html = html.substring(end[0].length);
        return match;
    }
}

parseStartTagEnd("/></div>") // { unarySlash: "/" }
Copy the code

The parsed unarySlash property of a self-closing tag is /, while the non-self-closing tag is an empty string.

Realize the source

function advance(n) {
  index += n;
  html = html.substring(n);
}

function parseStartTag() {
  // Parse the tag name
  const start = html.match(startTagOpen);
  if (start) {
    const match = {
      tagName: start[1].attrs: [].start: index,
    };
    advance(start[0].length);
    let end, attr;
    // Parse the tag attributes
    while(! (end = html.match(startTagClose)) && (attr = html.match(dynamicArgAttribute) || html.match(attribute))) { attr.start = index; advance(attr[0].length);
      attr.end = index;
      match.attrs.push(attr);
    }
    // Check whether the label is self-closing
    if (end) {
      match.unarySlash = end[1];
      advance(end[0].length);
      match.end = index;
      returnmatch; }}}Copy the code

Intercept the closing tag

Interception of the end tag is much easier than interception of the start tag, because it does not need to parse anything except to determine if the end tag is currently intercepted and, if so, to trigger the hook function.

If the first character of the HTML template is not <, it is definitely not the closing tag. Only if the first character of the HTML template is < do we need to further verify that it is a closing tag.

const ncname = `[a-zA-Z_][\\w\\-\\.]*`;
const qnameCapture = ` ((? :${ncname}\ \ :?${ncname}) `;

const endTag = new RegExp(` ^ < \ \ /${qnameCapture}[^ >] * > `)

'</div>'.match(endTag) // ["</div>", "div", index: 0, input: "</div>"]
"<div>".match(endTag) // null
Copy the code

When the end tag is resolved, two things need to be done. One is to intercept the template, and the other is to trigger the hook function:

const  endTagMatch = html.match(ednTag);
if (endTagMatch) {
    html = html.substring(endTagMatch[0].length);
    options.end(endTagMatch[1])
    continue;
}
Copy the code

Interception annotations

The first character of the remaining HTML template is <. If so, use the regular expression to further match:

const comment = / ^ <! --/;
if (comment.test(html)){
    const commentEnd = html.indexOf('-->');
    if (commentEnd >= 0) {
        html = html.substring(commentEnd + 3); }}Copy the code

Intercept conditional comments

The principle of intercepting conditional annotations is very similar to that of intercepting annotations. Conditional comments in vue.js are useless and will be truncated if written.