baseparser

When parsing a piece of HTML text, you can first parse it into a simple object, and then enrich each parsed node object by further processing the attributes and text. Eventually an AST is formed.

Step one – what do you want

name
  • Parse out the name of the tag, i.etagName = div
  • Parse out the property, i.eattrList = [id = "app"]
  • Parse out the intermediate text, which is the child element, i.echildren = ['name']

The sum of the above expectations yields the desired node object

const astNode = {
  tagName: "div".attrList: ['id = "app"'].children: ["name"]};Copy the code

This is just the original AST node, but this is the structure we have to parse out. To parse out these three parts of an HTML string, we need to constantly manipulate the input string, parse a paragraph, cut off the parsed paragraph, and work through it.

The second step — supporting data collection

For string operations, traversing a string is something anyone can imagine, and it works. Instead, we chose to use the re to match a string as a whole, which makes it much faster to parse out the AST structure contained in the HTML.

Since the parsing target is an HTML string, it’s a good idea to list the regees to use.

// Start tag
const startTag =
  ^ < / ((? :[a-zA-Z_][\-\.0-9_a-zA-Za-zA-Z\u00B7\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u037D\u037F-\u1FFF\u200C-\u200D\u203F-\u2040\u207 0-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]*\:)? [a-zA-Z_][\-\.0-9_a-zA-Za-zA-Z\u00B7\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u037D\u037F-\u1FFF\u200C-\u200D\u203F-\u2040\u2070 -\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]*)/;

// Start tag ends
const startTagClose = /^\s*(\/?) >/;

// End tag
const endTag = /^<\/([a-zA-Z_][\-\.0-9_a-zA-Za-zA-Z\u00B7\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u037D\u037F-\u1FFF\u200C-\u200D\u203F-\u2040 \u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]*[^>]*)>/
Copy the code

The above re is shamelessly extracted directly from vue source code, if you want to know the matching pattern of these re, you can enter re here to view.

At this point, with the ideas and materials ready, it’s time to start writing the analytic function that will parse the structure of the first step

Step three — start

Note that the code here is not a one-step process, but needs to be improved step by step to achieve our results. Next, you’ll start by parsing an HTML string.

Give the first HTML string to parse:


function parse(input) {
    let root = null // Used to save the ast node parsed
    let tagName = ' ' // Name of the label currently being parsed
    // Is the string iterated anyway
    while(input) {
        let textEnd = input.indexOf('<')
        if(textEnd === 0) {// < may be a start tag, an end tag, or just a <
            // First try to match the start tag
            const match = input.match(startTag)
            if(match){
                // The description is the start label
                input = input.slice(match[0].length)
                // Check whether the label is closed properly
                const closeStart = input.match(startTagClose)
                if(closeStart){
                    input = input.slice(closeStart[0].length)
                    // Indicates that the label is closed properly
                    root = {
                        tagName: match[1]}if(closeStart[1= = ='/') {// Indicates a self-closing label
                        input = input.slice(closeStart[0].length)
                        continue;
                    }
                    tagName = root.tagName
                }
            }
            const matchEnd = input.match(endTag)
            if(matchEnd){
                // The end tag is matched
                if(matchEnd[1] !== tagName){
                    // If the end and start labels are not matched, the labels are not valid and cannot be saved
                    root = null
                    break
                }
                input = input.slice(matchEnd[0].length)
            }
        }
    }
    return root
}

console.log('parse', parse('<div></div>'));

Copy the code

The code above is a process code based on a number of assumptions:

  • When a string begins with <, it is considered one of the opening tags, closing tags, or text.

    • I’m not going to worry about the text case, so it has to be the first two
  • There are two types of closed tags

    • Self-closing label<b />
    • Double label closure<div></div>

With these two premises identified, the process becomes clear. If the string is detected to start with <, start tag matches and end tag matches are performed once.

Start label processing

// Match the start tag
const match = input.match(startTag)
if(match){
    // The description is the start label
    input = input.slice(match[0].length)
    // Check whether the label is closed properly
    const closeStart = input.match(startTagClose)
    if(closeStart){
        // The label is closed properly
        input = input.slice(closeStart[0].length)
        root = {
            tagName: match[1]}if(closeStart[1= = ='/') {// Indicates a self-closing label
            input = input.slice(closeStart[0].length)
            continue;
        }
        tagName = root.tagName
    }
}
Copy the code

Handling the start tag is not difficult. The difficulty is that you know the content of match. Here is an example:

const a = '<div>'

const match = a.match(startTag)

/ * * * match the main contents are as follows: [* * * '< div', / / match to the part of the * 'div'] / / match to the tag name * * * /

Copy the code

Make sure the opening tag is completely closed, this is a complete opening tag, so here it is:

const a = '>'

const closeStart = input.match(startTagClose)

/** * The main contents of match are as follows: ** [* '>', // the matched part * undefined // If it is a self-closing label, this is /*] ** /

Copy the code

At this point, a complete match of the start tag is completed.

For the

string
  1. matching<divSection to get the tag namediv
  2. Match the rest of 1>To determine whether the tag is a complete tag 2.1 if a match is found/The label is a self-closing label. 2.2 No label is matched/Note If the label is not self-closing, the start label ends

End label processing

const matchEnd = input.match(endTag)
if(matchEnd){
    // The end tag is matched
    if(matchEnd[1] !== tagName){
        // If the end and start labels are not matched, the labels are not valid and cannot be saved
        root = null
        break
    }
    input = input.slice(matchEnd[0].length)
}
Copy the code

Closing tag processing is relatively easy, just need to make sure that the start and end of the label name is corresponding on the line.

conclusion

This article briefly analyzes the preparation and implementation of parsing HTML strings. Here you can parse HTML strings without attributes and child elements. There are a number of shortcomings, and there will be a series of articles to refine the parsing process and eventually a complete AST tree like Vue Complier.

Finally, a link to the code flowchart is attached