The rules of the regular expression is not much, 1 hour enough to read and understand, the hard part is how to apply those rules string together, my view is that there is no shortcut, only hand cooked, the law of ten thousand hours you should have heard, but in fact if just want to master the ability of a very niche and just want to reach the first-class level, 100 hours is enough
This article does not talk about the specific regular rules, but from the practical point of view, using the regular to achieve an HTML parser, for a given HTML legal string fragment, it can parse the tag fragment tag, nodeType, attributes (attributes), and will continue to parse the child nodes. Finally form a structure tree, the effect is as follows:
const str = '
'
const ast = genHTML(str)
console.log(ast)
Copy the code
Output:
[{"nodeType": 1."tag": "div"."attributes": { "class": "container"."name": "container-name" },
"children": [{"nodeType": 1."tag": "input"."attributes": { "type": "file" },
"children": []}, {"nodeType": 1."tag": "button"."attributes": { "class": "btn" },
"children": [{ "nodeType": 3."content": "Select file"}]}]Copy the code
Simple label data extraction
Agree on the parsed data structure
const nodeTypes = {
// Label node
element: 1.// Text node/blank node
text: 3.// Comment the node
comment: 8,}as const
type TAstNode = {
nodeType: typeof nodeTypes[keyof typeofnodeTypes] tag? :stringattributes? : { [key:string] :string} content? :stringchildren? : TAstNode[] }Copy the code
To simplify the problem, first assume that provided by the HTML fragment is label fragment is no child node, so the data that it contains only the name tags and attributes, the detailed method about how to extract the simple pieces, in an article before I’ve said, not much, only mention the difference and do simple series
For
, we can match the tag name and attribute according to the /<(\w+)\s*([^>]*)\s*>/. There are also non-self-closing tags: IMG, INPUT, BR, HR, meta, link, etc
, , , and are all valid. For /, the canonical value is (\/)? , \ is to escape /, the last? /^<(\w+)\s*([^>]*?) \s*(\/)? >/
const mt = s.match(/^<(\w+)\s*([^>]*?) \s*(\/)? >/)
// Tag name, such as div
const tag = mt[1]
// Attribute string, such as class="content" name="content-name"
const attributes = mt[2]
Copy the code
The next step is to perform the attributes processing to obtain the key-value pair of the attribute, which is re: /([^\s=]+)(=([“‘])(.*?). \ 3)? /
So we can get this method for handling tags
function genSingleStr(s: string) {
const mt = s.match(/^<(\w+)\s*([^>]*?) \s*(\/)? >/)
const obj = {
nodeType: nodeTypes.element,
tag: mt[1].attributes: {},
children: []}as TAstNode
const attributes = mt[2]
if (attributes) {
const mt1 = attributes.match(/([^\s=]+)(=(["'])(.*?) \ 3)? /g)
if (mt1) {
mt1.forEach(p= > {
const kv = p.trim().split('=')
obj.attributes[kv[0]] = kv[1].slice(1, -1)}}}return {
data: obj,
matchStr: mt[0]}}Copy the code
For the
fragment, run the method to get the result
{
"nodeType": 1."tag": "div"."attributes": {"class": "content"."name": "content-name"},
"children": []}Copy the code
For < img SRC = “https://example.com/1.png” > return:
{
"nodeType": 1."tag": "img"."attributes": {"src": "https://example.com/1.png"},
"children": []}Copy the code
Child node processing
The node is not self-closing
If the node doesn’t have child elements that’s easy, it just involves parsing the label, but obviously in most scenarios nodes have child nodes, so you need to continue processing the child nodes, while ensuring that the data structure reflects the parent-child relationship
For a node in an HTML fragment, how do you determine that the next node is a child node and not a sibling node? In fact, as long as the node has not yet encountered its end tag, all nodes between the start tag and the end tag are its children, otherwise it is its brothers
The focus then becomes how to determine the start tag and end tag of the node, that is, the range of fragments that the node includes
The opening tag and the closing tag must exist at the same time (otherwise it would not be a valid fragment), which is somewhat similar to the parenthesis pairing problem, where a close bracket must correspond to an open bracket
The parent node is only a start tag and an end tag, but because of its possible child node, its children is also has its own start and end tags, you can maintain an array of stack, meet start tag into the stack, ending tag out of the stack, the start and end tags could be the parent node may also be the parent node, However, if the stack is empty during loading and unloading, the parent node’s end label has been found
The first element of the stack is the parent’s start tag, and the last tag before the stack is emptied is the parent’s end tag
For example, for the following fragment:
<div>
<p><span></span></p>
</div>
<div></div>
Copy the code
Search for the first position of < and
function genHTML(s: string) {
const stack = []
const end = s.indexOf('< /')
const start = s.indexOf('<')
if (end <= start) {
// The end tag is matched first
} else {
// The start tag is matched first}}Copy the code
If the start tag is matched, the data associated with the tag should be pushed:
const beginRoot = genSingleStr(s)
stack.push(beginRoot.data)
Copy the code
If the stack is empty, it indicates that the label is the top-level parent node. If it is not, it indicates that the label is a child node, and the parent node is the last element in the stack. In this case, the label needs to be assigned to the children attribute of the parent node to maintain the parent-child relationship
const beginRoot = genSingleStr(s)
if(stack.length ! = =0) {
stack[stack.length - 1].children.push(beginRoot.data)
}
stack.push(beginRoot.data)
Copy the code
It then proceeds to parse the string fragment, cutting off the parsed string and leaving only the unparsed string, using a while loop to iterate over the unparsed fragment
function genHTML(s: string) {
while (s) {
const stack = []
const end = s.indexOf('< /')
const start = s.indexOf('<')
if (end <= start) {
// The end tag is matched first
} else {
// The start tag is matched first
const beginRoot = genSingleStr(s)
if(stack.length ! = =0) {
stack[stack.length - 1].children.push(beginRoot.data)
}
stack.push(beginRoot.data)
s = s.slice(beginRoot.matchStr.length))
}
}
}
Copy the code
If it matches the end tag, it must be the same as the last tag in the stack, otherwise it’s an invalid fragment, so you can check it here
= /<\/(\w+)>/ = = = /<\/(\w+)[^>]*>/
if (end <= start) {
const mtEnd = s.match(/<\/(\w+)[^>]*>/)
if(! mtEnd) {console.log('Failed to match end tag:', s.slice(0.20))
return null
}
const tag = mtEnd[1]
if(tag ! == stack[stack.length -1].tag) {
console.log(The 'tag does not match,${tag}= >${stack[stack.length - 1].tag}`)
return null}}Copy the code
After the success of the start and end tags match, should put the last item in the stack is also on behalf of the end of the match tag data out of the stack, and intercepting segment summary not remaining string matching to parse, and in every match to end tag to check whether the stack is empty, if the stack is empty has been matched to a full range of the parent node, Loop should exit
if (end <= start) {
// ...
stack.pop()
s = s.slice(mtEnd[0].length)
if (stack.length === 0) {
break}}Copy the code
Special nodes
There is another problem here. We assume that string fragments are normal tag nodes like
, but legal nodes also include text nodes, blank runes, comment nodes, and self-closing tag nodes, all of which need to be considered
These nodes are special because they have no child nodes and no corresponding closing tags, so they need to be handled individually
For text nodes, anything that does not begin with a < string and does not contain a < is a text node (in fact, text nodes can also contain <, but for simplicity’s sake, I’ll leave it at that) :
// Matches the literal node/blank node before the label
function matchTextEmpty(s: string) {
return s.match(/ ^ ^ <] + (? = <) /)}Copy the code
Zero-width positive predictive predictive predictors are used here, (? =<) represents the position before matching <
For blank nodes:
// Matches the blank runt before the label
function matchEmpty(s: string) {
return s.match(/^\s+/)}Copy the code
For a comment node, the comment node must start with
// Matches the comment tag
function matchComment(s: string) {
return s.match(/ ^ <! * - [^ >] -- > /)}Copy the code
For self-closing tag nodes, there are only those self-closing tags, which can be listed directly:
// Matches the self-closing label
function matchAutoCloseTag(s: string) {
return s.match(/^<(input|img|br|hr|meta|link)[^>]*>/)}Copy the code
Encapsulate the matching logic for these particular nodes
function manageSingleChild(s: string) {
// Text node/blank node
let mt = matchTextEmpty(s)
if (mt) {
return {
str: s.slice(mt[0].length),
node: { nodeType: nodeTypes.text, content: mt[0]}}}// Empty node point
mt = matchEmpty(s)
if (mt) {
return {
str: s.slice(mt[0].length),
node: { nodeType: nodeTypes.text, content: mt[0]}}}// Self-closing label
mt = matchAutoCloseTag(s)
if (mt) {
return {
str: s.slice(mt[0].length),
node: genSingleStr(s)
}
}
// Comment the tag
mt = matchComment(s)
if (mt) {
return {
str: s.slice(mt[0].length),
node: { nodeType: nodeTypes.comment, content: mt[0]}}}return null
}
Copy the code
Before parsing the fragment, check to see if the fragment starts with one of these special nodes. If so, there is no need to go through the following child node parsing:
while (s.length) {
const singleChildNode = manageSingleChild(s)
if (singleChildNode) {
stack[stack.length - 1].children.push(singleChildNode.node)
s = singleChildNode.str
continue
}
const end = s.indexOf('< /')
const start = s.indexOf('<')
// ...
}
Copy the code
Brother nodes
The next problem is that the top-level nodes have siblings, such as.container and.footer for the following HTML fragment
<div class="container"><p></p></div>
<div class="footer"></div>
Copy the code
Then we can add another loop, the outermost loop dealing with siblings, and the inner loop dealing with parent nodes. Similarly, sibling nodes can be non-self-closing tags, self-closing tags, text nodes, blank nodes, comment nodes, all of which need to be considered. This is similar to manageSingleChild
function genHTML(s: string) {
const root = [] as TAstNode[]
while (s.length) {
const singleNode = manageSingleRoot(s)
if (singleNode) {
root.push(singleNode.node)
s = singleNode.str
continue
}
const stack = []
while (s.length) {
// ...}}return root
}
Copy the code
At this point, the logic of the entire HTML method is complete
summary
Implemented in this paper is just a simple version of HTML parsing method, is not complete, complete code method is certainly not this quantity can solve, but this is not the focus of this article, this article mainly is to want to based on an actual combat scene, use regular expressions to solve practical problems, learn how to use is the key, after all, teach them to fish than teach them to fish
The full code is on Github
The last
Bytedance – Live Realisation and recruitment!
No less than 10 HC (cheat is a puppy), Beijing, Shanghai, Hangzhou can be, and is urgent recruit, just a programmer of that kind
Internship is accepted without limitation
It doesn’t matter if you have sent bytes to other departments before (failing the interview does not necessarily mean your ability is not good, it may be that your eye edge is not enough), you can continue to meet my department (in case you see the right eye, after all, our department is really short of people), you can send your resume to my email [email protected], all the people who send my resume, Make sure to follow up and give feedback on the progress of the interview, always answer questions (as long as it doesn’t violate company policy), and avoid the bad experience of having your resume tossed out of the window
Save the children! Come and share my needs!
Team to introduce
In charge of optimizing bytedance’s live streaming ads and short video e-commerce ads in China, and in charge of platform construction, algorithm optimization, and implementation of advertising products and operation strategies of Giant Qianchuan. The primary goal is to further increase the commercial revenue in China by leveraging byte’s powerful algorithmic engineering capabilities and giving full play to the natural advantages of the live streaming genre and e-commerce closed-loop. The life service business relies on Douyin, Douyin Speed Edition and other platforms to promote the connection between users and local services; In the past year, the life service business has created a new video planting and transaction experience, enabling more users to discover offline good places through Douyin, and helping local merchants expand into new business fronts. We look forward to your joining us. We hope you will be the next one to serve millions of merchants, influence hundreds of millions of consumers and lead the marketing revolution!
Team advantage
- Technology: as the ultimate guide to business, even as a research and development, will also be able to contact a line of customers, through technical solutions to customer problems, technical solution related to recall in the advertising system, row, row, bid, sorting, and many other links, to an in-depth understanding of advertisements each link the internals of the system.
- Growth: Byte e-commerce GMV is still improving at a high speed. When meeting purchase demands, short video and live broadcast have subversive advantages, and there is a great space for business growth.
- Opportunities: The buying experience of byte e-commerce is more diversified, including commodities, video, live stream, talent, fan relationship, live interaction, etc. Compared with traditional shelving e-commerce, the scope is larger and the development opportunities for individuals are more.