threeparse
Above, the basic parsing and attribute parsing of HTML string are completed in two parts. However, only a single label can be resolved, and parent-child relationship cannot be resolved. This time, will begin to analyze the relationship between superiors and subordinates.
It takes a little preparation
Before you start, it is important to clarify your requirements so that you are more targeted.
dasdad
root = {
tagName: "div".attrList: [].children: [
"qwweerrwerw",
{
tagName: "span".attrList: [].children: ["dasdad"].parent: root,
},
],
};
Copy the code
As you can see, there are roughly two types of child elements:
- Text: Contains interpolated text and plain text
- AstNode elements
Because of the hierarchical relationship, and in the legal case, all tags are closed normally, the stack can be used to maintain the parent-child relationship. When a tag is closed successfully, the element at the top of the stack is the parent element of the current closed element. Instead, the element is one of the children of the current parent element.
For text nodes, there is no need to maintain the parent node element, just add it to the parent element’s child element array. However, parsing text elements can be a bit difficult. This is because detecting the beginning of a < is considered to be the beginning of a start/end tag, but it is also possible that the < character is present in a paragraph of text. Let’s start with text parsing.
Required regulars
// Start tag
const startTag =
^ < / ((? :[a-zA-Z_][\-\.0-9_a-zA-Za-zA-Z\u00B7\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u037D\u037F-\u1FFF\u200C-\u200D\u203F-\u2040\u207 0-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]*\:)? [a-zA-Z_][\-\.0-9_a-zA-Za-zA-Z\u00B7\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u037D\u037F-\u1FFF\u200C-\u200D\u203F-\u2040\u2070 -\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]*)/;
// End tag
const endTag =
/^<\/([a-zA-Z_][\-\.0-9_a-zA-Za-zA-Z\u00B7\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u037D\u037F-\u1FFF\u200C-\u200D\u203F-\u2040 \u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]*[^>]*)>/;
Copy the code
The code is as follows:
function parse(input){
/ /... somecode
while(input){
let textEnd = input.indexOf('<')
/ /... The code that parses the opening and closing tags in somecode's second article
if(textEnd >= 0) {
let tmp = input.slice(textEnd)
// if dadad< dadada exists
while(
!startTag.test(tmp) && // Not the start tag! endTag.test(tmp)// Not the closing tag) {// 1 means to start the search from the next bit of <, which is the index to start the search
const n = tmp.indexOf('<'.1)
if(n < 0) return
textEnd += n
tmp = tmp.slice(n)
}
const text = input.slice(0, textEnd)
currentPatent.children.push(text)
input = input.slice(textEnd)
}
}
}
Copy the code
This code is not complete code, but it is complete code that parses out a little bit of text. The key point is that when dealing with <, what you do here is to parse out as much of the text as possible that is not the opening and closing tags. The idea here is that if you want to use a < in text, you must add some non-label characters after the <, such as: <, <@, etc., so that the text will be parsed, otherwise it will be parsed as illegal start tags and discarded
Add parent-child relationship parsing
The maintenance of parent-child relationship needs the cooperation of stack, here will be the complete code to post out and then slowly parse:
function parse(input) {
let root = null // Used to save the ast node parsed
let ele = null // The element currently being parsed
let tagName = ' ' // Name of the label currently being parsed
let stack = [] // A stack used to maintain parent-child relationships
// Is the string iterated anyway
while(input) {
let textEnd = input.indexOf('<')
if(textEnd === 0) {// < may be a start tag, an end tag, or just a <
// First try to match the start tag
const match = input.match(startTag)
if(match){
// The description is the start label
input = input.slice(match[0].length)
// Check whether the label is closed properly
let closeStart = null
let attr = null
let matchNode = {
tagName: match[1].attrList: []}while(! (closeStart = input.match(startTagClose)) && (attr = input.match(dynamicArgAttribute) || input.match(attribute))){// Collect attributes
matchNode.attrList.push({
name: attr[1].value: attr[3] || attr[4] || attr[5]
})
input = input.slice(attr[0].length)
}
if(closeStart){
input = input.slice(closeStart[0].length)
// The start tag ends, creating the ast HOME message of the nodeele = { ... matchNode,parent: null.children: []}// If there is no trailing node, the current element is the trailing node
if(! root){ root = ele } stack.push(ele)// The element is pushed in preparation for parsing its root element
if(closeStart[1= = ='/') {// Indicates a self-closing label
stack.pop() // Label end, out of the stack
if(stack.length){
// Sets the parent relationship of the current element
ele.parent = stack[stack.length - 1]
stack[stack.length - 1].children.push(ele)
}
} else {
tagName = ele.tagName
}
continue; }}const matchEnd = input.match(endTag)
if(matchEnd){
console.log(matchEnd);
// The end tag is matched
if(matchEnd[1] !== tagName){
// If the end and start labels are not matched, the labels are not valid and cannot be saved
root = null
break
}
stack.pop() // Label end, out of the stack
if(stack.length){
// Sets the parent relationship of the current element
ele.parent = stack[stack.length - 1]
stack[stack.length - 1].children.push(ele)
// Reset tagName to the parent node's tagName
tagName = stack[stack.length - 1].tagName
}
input = input.slice(matchEnd[0].length)
continue; }}if(textEnd >= 0) {
let tmp = input.slice(textEnd)
// if dadad< dadada exists
while(
!startTag.test(tmp) && // Not the start tag! endTag.test(tmp)// Not the closing tag) {// 1 means to start the search from the next bit of <, which is the index to start the search
const n = tmp.indexOf('<'.1)
if(n < 0) return
textEnd += n
tmp = tmp.slice(n)
}
const text = input.slice(0, textEnd)
stack[stack.length - 1].children.push(text)
input = input.slice(textEnd)
}
}
return root
}
console.log('parse', parse('<div id="app" :b="c" v-html="d" :[xxx] = "e">dsdasdasd< dasdasda<span>dasdasdsada</span></div>'));
Copy the code
A new stack is added here to maintain the hierarchy. The lower the node, the closer it is to the bottom of the stack, the element currently parsed is at the bottom of the stack, so when the node currently parsed is finished, it will go off the stack. In this case, the element at the bottom of the stack is the parent element of the current ending element. Using this principle, the parent-child relationship can be smoothly maintained at the end of each element.
However, the above code does not have a compatibility strategy for exception code. Of course, the compatibility for this exception case is to make the incomplete tag into a closed tag, so that the first span above will contain all the children of the div. The final result is
.
The code for the final version is as follows:
// Start tag
const startTag =
^ < / ((? :[a-zA-Z_][\-\.0-9_a-zA-Za-zA-Z\u00B7\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u037D\u037F-\u1FFF\u200C-\u200D\u203F-\u2040\u207 0-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]*\:)? [a-zA-Z_][\-\.0-9_a-zA-Za-zA-Z\u00B7\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u037D\u037F-\u1FFF\u200C-\u200D\u203F-\u2040\u2070 -\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]*)/;
// Tag attributes
const attribute =
/^\s*([^\s"'<>\/=]+)(? :\s*(=)\s*(? :"([^"]*)"+|'([^']*)'+|([^\s"'=<>`]+)))? /;
// Parse dynamic properties
const dynamicArgAttribute = /^\s*((? :v-[\w-]+:|@|:|#)\[[^=]+?\][^\s"'<>\/=]*)(? :\s*(=)\s*(? :"([^"]*)"+|'([^']*)'+|([^\s"'=<>`]+)))? /
// Start tag ends
const startTagClose = /^\s*(\/?) >/;
// End tag
const endTag = /^<\/([a-zA-Z_][\-\.0-9_a-zA-Za-zA-Z\u00B7\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u037D\u037F-\u1FFF\u200C-\u200D\u203F-\u2040 \u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]*[^>]*)>/
function parse(input) {
let root = null // Used to save the ast node parsed
let ele = null // The element currently being parsed
let tagName = ' ' // Name of the label currently being parsed
let stack = [] // A stack used to maintain parent-child relationships
// Is the string iterated anyway
while(input.trim()) {
let textEnd = input.indexOf('<')
if(textEnd === 0) {// < may be a start tag, an end tag, or just a <
// First try to match the start tag
const match = input.match(startTag)
if(match){
// The description is the start label
input = input.slice(match[0].length)
// Check whether the label is closed properly
let closeStart = null
let attr = null
let matchNode = {
tagName: match[1].attrList: []}while(! (closeStart = input.match(startTagClose)) && (attr = input.match(dynamicArgAttribute) || input.match(attribute))){// Collect attributes
matchNode.attrList.push({
name: attr[1].value: attr[3] || attr[4] || attr[5]
})
input = input.slice(attr[0].length)
}
if(closeStart){
input = input.slice(closeStart[0].length)
// The start tag ends, creating the ast HOME message of the nodeele = { ... matchNode,parent: null.children: []}// If there is no trailing node, the current element is the trailing node
if(! root){ root = ele } stack.push(ele)// The element is pushed in preparation for parsing its root element
if(closeStart[1= = ='/') {// Indicates a self-closing label
stack.pop() // Label end, out of the stack
if(stack.length){
// Sets the parent relationship of the current element
ele.parent = stack[stack.length - 1]
stack[stack.length - 1].children.push(ele)
}
} else {
tagName = ele.tagName
}
continue; }}const matchEnd = input.match(endTag)
if(matchEnd){
// console.log(matchEnd);
// The end tag is matched
if(matchEnd[1] !== tagName){
console.log('tagName',tagName, matchEnd[1]);
// If the end and start labels are not matched, the labels are not valid and cannot be saved
let pos = 0
for (pos = stack.length - 1; pos >=0 ; pos--) {
if(stack[pos].tagName === tagName) {
break}}if(pos>=0) {
for (let i = stack.length - 1; i >=pos ; i--) {
// Sets the parent relationship of the current element
stack[pos].parent = stack[i - 1]
stack[i - 1].children.push(stack[pos])
}
}
stack.length = pos
tagName = stack[pos - 1].tagName
} else {
stack.pop() // Label end, out of the stack
if(stack.length){
// Sets the parent relationship of the current element
ele.parent = stack[stack.length - 1]
stack[stack.length - 1].children.push(ele)
// Reset tagName to the parent node's tagName
tagName = stack[stack.length - 1].tagName
}
input = input.slice(matchEnd[0].length)
continue; }}}if(textEnd >= 0) {
let tmp = input.slice(textEnd)
// if dadad< dadada exists
while(
!startTag.test(tmp) && // Not the start tag! endTag.test(tmp)// Not the closing tag) {// 1 means to start the search from the next bit of <, which is the index to start the search
const n = tmp.indexOf('<'.1)
if(n < 0) return
textEnd += n
tmp = tmp.slice(n)
}
const text = input.slice(0, textEnd)
ele && ele.children.push(text)
input = input.slice(textEnd)
}
}
return root
}
let a = null
console.log('parse', a = parse(`
<div id="app" :b="c" v-html="d" :[xxx] = "e">
<span :s="ss">d
sdasdasddasdasda
<span>dasdasdsada</span>
<div><b>323213123</b></div>
</div>
`));
Copy the code
conclusion
At this point, the basic analysis of the code has been completed, but there are many deficiencies and explanations are not detailed, the subsequent article will continue to optimize.