threeparse

Above, the basic parsing and attribute parsing of HTML string are completed in two parts. However, only a single label can be resolved, and parent-child relationship cannot be resolved. This time, will begin to analyze the relationship between superiors and subordinates.

It takes a little preparation

Before you start, it is important to clarify your requirements so that you are more targeted.

qwweerrwerw
dasdad
root = {
  tagName: "div".attrList: [].children: [
    "qwweerrwerw",
    {
      tagName: "span".attrList: [].children: ["dasdad"].parent: root,
    },
  ],
};
Copy the code

As you can see, there are roughly two types of child elements:

  • Text: Contains interpolated text and plain text
  • AstNode elements

Because of the hierarchical relationship, and in the legal case, all tags are closed normally, the stack can be used to maintain the parent-child relationship. When a tag is closed successfully, the element at the top of the stack is the parent element of the current closed element. Instead, the element is one of the children of the current parent element.

For text nodes, there is no need to maintain the parent node element, just add it to the parent element’s child element array. However, parsing text elements can be a bit difficult. This is because detecting the beginning of a < is considered to be the beginning of a start/end tag, but it is also possible that the < character is present in a paragraph of text. Let’s start with text parsing.

Required regulars

// Start tag
const startTag =
  ^ < / ((? :[a-zA-Z_][\-\.0-9_a-zA-Za-zA-Z\u00B7\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u037D\u037F-\u1FFF\u200C-\u200D\u203F-\u2040\u207 0-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]*\:)? [a-zA-Z_][\-\.0-9_a-zA-Za-zA-Z\u00B7\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u037D\u037F-\u1FFF\u200C-\u200D\u203F-\u2040\u2070 -\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]*)/;

// End tag
const endTag =
  /^<\/([a-zA-Z_][\-\.0-9_a-zA-Za-zA-Z\u00B7\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u037D\u037F-\u1FFF\u200C-\u200D\u203F-\u2040 \u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]*[^>]*)>/;
Copy the code

The code is as follows:

function parse(input){
    / /... somecode
    while(input){
        let textEnd = input.indexOf('<')
        / /... The code that parses the opening and closing tags in somecode's second article
        if(textEnd >= 0) {
            let tmp = input.slice(textEnd)
            // if dadad< dadada exists
            while(
                !startTag.test(tmp) && // Not the start tag! endTag.test(tmp)// Not the closing tag) {// 1 means to start the search from the next bit of <, which is the index to start the search
                const n = tmp.indexOf('<'.1)
                if(n < 0) return 
                textEnd += n
                tmp = tmp.slice(n) 
            }
            const text = input.slice(0, textEnd)
            currentPatent.children.push(text)
            input = input.slice(textEnd)
        }
    }
}
Copy the code

This code is not complete code, but it is complete code that parses out a little bit of text. The key point is that when dealing with <, what you do here is to parse out as much of the text as possible that is not the opening and closing tags. The idea here is that if you want to use a < in text, you must add some non-label characters after the <, such as: <, <@, etc., so that the text will be parsed, otherwise it will be parsed as illegal start tags and discarded

Add parent-child relationship parsing

The maintenance of parent-child relationship needs the cooperation of stack, here will be the complete code to post out and then slowly parse:

function parse(input) {
    let root = null // Used to save the ast node parsed
    let ele = null // The element currently being parsed
    let tagName = ' ' // Name of the label currently being parsed
    let stack = [] // A stack used to maintain parent-child relationships
    // Is the string iterated anyway
    while(input) {
        let textEnd = input.indexOf('<')
        if(textEnd === 0) {// < may be a start tag, an end tag, or just a <

            // First try to match the start tag
            const match = input.match(startTag)
            if(match){
                // The description is the start label
                input = input.slice(match[0].length)
                // Check whether the label is closed properly
                let closeStart = null
                let attr = null
                let matchNode = {
                    tagName: match[1].attrList: []}while(! (closeStart = input.match(startTagClose)) && (attr = input.match(dynamicArgAttribute) || input.match(attribute))){// Collect attributes
                    matchNode.attrList.push({
                        name: attr[1].value: attr[3] || attr[4] || attr[5]
                    })
                    input = input.slice(attr[0].length)
                }
                if(closeStart){
                    input = input.slice(closeStart[0].length)
                    // The start tag ends, creating the ast HOME message of the nodeele = { ... matchNode,parent: null.children: []}// If there is no trailing node, the current element is the trailing node
                    if(! root){ root = ele } stack.push(ele)// The element is pushed in preparation for parsing its root element
                    if(closeStart[1= = ='/') {// Indicates a self-closing label
                        stack.pop() // Label end, out of the stack
                        if(stack.length){
                             // Sets the parent relationship of the current element
                            ele.parent = stack[stack.length - 1]
                            stack[stack.length - 1].children.push(ele)
                        }
                    } else {
                        tagName = ele.tagName
                    }
                    continue; }}const matchEnd = input.match(endTag)
            if(matchEnd){
                console.log(matchEnd);
                // The end tag is matched
                if(matchEnd[1] !== tagName){
                    // If the end and start labels are not matched, the labels are not valid and cannot be saved
                    root = null
                    break
                }
                stack.pop() // Label end, out of the stack
                if(stack.length){
                    // Sets the parent relationship of the current element
                    ele.parent = stack[stack.length - 1]
                    stack[stack.length - 1].children.push(ele)
                    // Reset tagName to the parent node's tagName
                    tagName = stack[stack.length - 1].tagName
                }
                input = input.slice(matchEnd[0].length)
                continue; }}if(textEnd >= 0) {
            let tmp = input.slice(textEnd)
            // if dadad< dadada exists
            while(
                !startTag.test(tmp) && // Not the start tag! endTag.test(tmp)// Not the closing tag) {// 1 means to start the search from the next bit of <, which is the index to start the search
                const n = tmp.indexOf('<'.1)
                if(n < 0) return 
                textEnd += n
                tmp = tmp.slice(n) 
            }
            const text = input.slice(0, textEnd)
            stack[stack.length - 1].children.push(text)
            input = input.slice(textEnd)
        }
    }
    return root
}

console.log('parse', parse('<div id="app" :b="c" v-html="d" :[xxx] = "e">dsdasdasd< dasdasda<span>dasdasdsada</span></div>'));
Copy the code

A new stack is added here to maintain the hierarchy. The lower the node, the closer it is to the bottom of the stack, the element currently parsed is at the bottom of the stack, so when the node currently parsed is finished, it will go off the stack. In this case, the element at the bottom of the stack is the parent element of the current ending element. Using this principle, the parent-child relationship can be smoothly maintained at the end of each element.

dsdasdasdada

However, the above code does not have a compatibility strategy for exception code. Of course, the compatibility for this exception case is to make the incomplete tag into a closed tag, so that the first span above will contain all the children of the div. The final result is

dsdasdasdada

.

The code for the final version is as follows:

// Start tag
const startTag =
  ^ < / ((? :[a-zA-Z_][\-\.0-9_a-zA-Za-zA-Z\u00B7\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u037D\u037F-\u1FFF\u200C-\u200D\u203F-\u2040\u207 0-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]*\:)? [a-zA-Z_][\-\.0-9_a-zA-Za-zA-Z\u00B7\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u037D\u037F-\u1FFF\u200C-\u200D\u203F-\u2040\u2070 -\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]*)/;

// Tag attributes
const attribute =
  /^\s*([^\s"'<>\/=]+)(? :\s*(=)\s*(? :"([^"]*)"+|'([^']*)'+|([^\s"'=<>`]+)))? /;

// Parse dynamic properties
const dynamicArgAttribute = /^\s*((? :v-[\w-]+:|@|:|#)\[[^=]+?\][^\s"'<>\/=]*)(? :\s*(=)\s*(? :"([^"]*)"+|'([^']*)'+|([^\s"'=<>`]+)))? /

// Start tag ends
const startTagClose = /^\s*(\/?) >/;

// End tag
const endTag = /^<\/([a-zA-Z_][\-\.0-9_a-zA-Za-zA-Z\u00B7\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u037D\u037F-\u1FFF\u200C-\u200D\u203F-\u2040 \u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]*[^>]*)>/


function parse(input) {
    let root = null // Used to save the ast node parsed
    let ele = null // The element currently being parsed
    let tagName = ' ' // Name of the label currently being parsed
    let stack = [] // A stack used to maintain parent-child relationships
    // Is the string iterated anyway
    while(input.trim()) {
        let textEnd = input.indexOf('<')
        if(textEnd === 0) {// < may be a start tag, an end tag, or just a <

            // First try to match the start tag
            const match = input.match(startTag)
            if(match){
                // The description is the start label
                input = input.slice(match[0].length)
                // Check whether the label is closed properly
                let closeStart = null
                let attr = null
                let matchNode = {
                    tagName: match[1].attrList: []}while(! (closeStart = input.match(startTagClose)) && (attr = input.match(dynamicArgAttribute) || input.match(attribute))){// Collect attributes
                    matchNode.attrList.push({
                        name: attr[1].value: attr[3] || attr[4] || attr[5]
                    })
                    input = input.slice(attr[0].length)
                }
                if(closeStart){
                    input = input.slice(closeStart[0].length)
                    // The start tag ends, creating the ast HOME message of the nodeele = { ... matchNode,parent: null.children: []}// If there is no trailing node, the current element is the trailing node
                    if(! root){ root = ele } stack.push(ele)// The element is pushed in preparation for parsing its root element
                    if(closeStart[1= = ='/') {// Indicates a self-closing label
                        stack.pop() // Label end, out of the stack
                        if(stack.length){
                             // Sets the parent relationship of the current element
                            ele.parent = stack[stack.length - 1]
                            stack[stack.length - 1].children.push(ele)
                        }
                    } else {
                        tagName = ele.tagName
                    }
                    continue; }}const matchEnd = input.match(endTag)
            if(matchEnd){
                // console.log(matchEnd);
                // The end tag is matched
                if(matchEnd[1] !== tagName){
                    console.log('tagName',tagName, matchEnd[1]);
                    // If the end and start labels are not matched, the labels are not valid and cannot be saved
                    let pos = 0
                    for (pos = stack.length - 1; pos >=0 ; pos--) {
                        if(stack[pos].tagName === tagName) {
                            break}}if(pos>=0) {
                        for (let i = stack.length - 1; i >=pos ; i--) {
                            // Sets the parent relationship of the current element
                            stack[pos].parent = stack[i - 1]
                            stack[i - 1].children.push(stack[pos])
                        }
                    }
                    stack.length = pos
                    tagName = stack[pos - 1].tagName
                } else {
                    stack.pop() // Label end, out of the stack
                    if(stack.length){
                        // Sets the parent relationship of the current element
                        ele.parent = stack[stack.length - 1]
                        stack[stack.length - 1].children.push(ele)
                        // Reset tagName to the parent node's tagName
                        tagName = stack[stack.length - 1].tagName
                    }
                    input = input.slice(matchEnd[0].length)
                    continue; }}}if(textEnd >= 0) {
            let tmp = input.slice(textEnd)
            // if dadad< dadada exists
            while(
                !startTag.test(tmp) && // Not the start tag! endTag.test(tmp)// Not the closing tag) {// 1 means to start the search from the next bit of <, which is the index to start the search
                const n = tmp.indexOf('<'.1)
                if(n < 0) return 
                textEnd += n
                tmp = tmp.slice(n) 
            }
            const text = input.slice(0, textEnd)
            ele && ele.children.push(text)
            input = input.slice(textEnd)
        }
    }
    return root
}
let a = null
console.log('parse', a = parse(`
<div id="app" :b="c" v-html="d" :[xxx] = "e">
<span :s="ss">d
sdasdasddasdasda
<span>dasdasdsada</span>
<div><b>323213123</b></div>
</div>
`));
Copy the code

conclusion

At this point, the basic analysis of the code has been completed, but there are many deficiencies and explanations are not detailed, the subsequent article will continue to optimize.