No concept! Regular expression combat hand - in - hand implementation of an HTML parsing function

The rules of the regular expression is not much, 1 hour enough to read and understand, the hard part is how to apply those rules string together, my view is that there is no shortcut, only hand cooked, the law of ten thousand hours you should have heard, but in fact if just want to master the ability of a very niche and just want to reach the first-class level, 100 hours is enough

This article does not talk about the specific regular rules, but from the practical point of view, using the regular to achieve an HTML parser, for a given HTML legal string fragment, it can parse the tag fragment tag, nodeType, attributes (attributes), and will continue to parse the child nodes. Finally form a structure tree, the effect is as follows:

const str = '
      
       
       
      
'
const ast = genHTML(str)
console.log(ast)
Copy the code

Output:

[{"nodeType": 1."tag": "div"."attributes": { "class": "container"."name": "container-name" },
    "children": [{"nodeType": 1."tag": "input"."attributes": { "type": "file" },
        "children": []}, {"nodeType": 1."tag": "button"."attributes": { "class": "btn" },
        "children": [{ "nodeType": 3."content": "Select file"}]}]Copy the code

Simple label data extraction

Agree on the parsed data structure

const nodeTypes = {
  // Label node
  element: 1.// Text node/blank node
  text: 3.// Comment the node
  comment: 8,}as const

type TAstNode = {
  nodeType: typeof nodeTypes[keyof typeofnodeTypes] tag? :stringattributes? : { [key:string] :string} content? :stringchildren? : TAstNode[] }Copy the code

To simplify the problem, first assume that provided by the HTML fragment is label fragment is no child node, so the data that it contains only the name tags and attributes, the detailed method about how to extract the simple pieces, in an article before I’ve said, not much, only mention the difference and do simple series

For

, we can match the tag name and attribute according to the /<(\w+)\s*([^>]*)\s*>/. There are also non-self-closing tags: IMG, INPUT, BR, HR, meta, link, etc

, , , and are all valid. For /, the canonical value is (\/)? , \ is to escape /, the last? /^<(\w+)\s*([^>]*?) \s*(\/)? >/

const mt = s.match(/^<(\w+)\s*([^>]*?) \s*(\/)? >/)
// Tag name, such as div
const tag = mt[1]
// Attribute string, such as class="content" name="content-name"
const attributes = mt[2]
Copy the code

The next step is to perform the attributes processing to obtain the key-value pair of the attribute, which is re: /([^\s=]+)(=([“‘])(.*?). \ 3)? /

So we can get this method for handling tags

function genSingleStr(s: string) {
  const mt = s.match(/^<(\w+)\s*([^>]*?) \s*(\/)? >/)
  const obj = {
    nodeType: nodeTypes.element,
    tag: mt[1].attributes: {},
    children: []}as TAstNode
  const attributes = mt[2]
  if (attributes) {
    const mt1 = attributes.match(/([^\s=]+)(=(["'])(.*?) \ 3)? /g)
    if (mt1) {
      mt1.forEach(p= > {
        const kv = p.trim().split('=')
        obj.attributes[kv[0]] = kv[1].slice(1, -1)}}}return {
    data: obj,
    matchStr: mt[0]}}Copy the code

For the

fragment, run the method to get the result

{
  "nodeType": 1."tag": "div"."attributes": {"class": "content"."name": "content-name"},
  "children": []}Copy the code

For < img SRC = “https://example.com/1.png” > return:

{
  "nodeType": 1."tag": "img"."attributes": {"src": "https://example.com/1.png"},
  "children": []}Copy the code

Child node processing

The node is not self-closing

If the node doesn’t have child elements that’s easy, it just involves parsing the label, but obviously in most scenarios nodes have child nodes, so you need to continue processing the child nodes, while ensuring that the data structure reflects the parent-child relationship

For a node in an HTML fragment, how do you determine that the next node is a child node and not a sibling node? In fact, as long as the node has not yet encountered its end tag, all nodes between the start tag and the end tag are its children, otherwise it is its brothers

The focus then becomes how to determine the start tag and end tag of the node, that is, the range of fragments that the node includes

The opening tag and the closing tag must exist at the same time (otherwise it would not be a valid fragment), which is somewhat similar to the parenthesis pairing problem, where a close bracket must correspond to an open bracket

The parent node is only a start tag and an end tag, but because of its possible child node, its children is also has its own start and end tags, you can maintain an array of stack, meet start tag into the stack, ending tag out of the stack, the start and end tags could be the parent node may also be the parent node, However, if the stack is empty during loading and unloading, the parent node’s end label has been found

The first element of the stack is the parent’s start tag, and the last tag before the stack is emptied is the parent’s end tag

For example, for the following fragment:

<div>
  <p><span></span></p>
</div>
<div></div>
Copy the code

Found is closing tags, then the last element of stack up [‘ div ‘, ‘p’], and then met < / p > found closing tags, then the last element of stack [‘ div ‘], following meet < / div >, discovery is closing tags, then the last element of stack [], If the stack is empty, a full node range has been read. If the stack is empty, it is the sibling of the node

Search for the first position of < and

function genHTML(s: string) {
  const stack = []
  const end = s.indexOf('< /')
  const start = s.indexOf('<')
  if (end <= start) {
    // The end tag is matched first

  } else {
    // The start tag is matched first}}Copy the code

If the start tag is matched, the data associated with the tag should be pushed:

const beginRoot = genSingleStr(s)
stack.push(beginRoot.data)
Copy the code

If the stack is empty, it indicates that the label is the top-level parent node. If it is not, it indicates that the label is a child node, and the parent node is the last element in the stack. In this case, the label needs to be assigned to the children attribute of the parent node to maintain the parent-child relationship

const beginRoot = genSingleStr(s)
if(stack.length ! = =0) {
  stack[stack.length - 1].children.push(beginRoot.data)
}
stack.push(beginRoot.data)
Copy the code

It then proceeds to parse the string fragment, cutting off the parsed string and leaving only the unparsed string, using a while loop to iterate over the unparsed fragment

function genHTML(s: string) {
  while (s) {
    const stack = []
    const end = s.indexOf('< /')
    const start = s.indexOf('<')
    if (end <= start) {
      // The end tag is matched first

    } else {
      // The start tag is matched first
      const beginRoot = genSingleStr(s)
      if(stack.length ! = =0) {
        stack[stack.length - 1].children.push(beginRoot.data)
      }
      stack.push(beginRoot.data)
      s = s.slice(beginRoot.matchStr.length))
    }
  }
}
Copy the code

If it matches the end tag, it must be the same as the last tag in the stack, otherwise it’s an invalid fragment, so you can check it here

= /<\/(\w+)>/ = = = /<\/(\w+)[^>]*>/

if (end <= start) {
  const mtEnd = s.match(/<\/(\w+)[^>]*>/)
  if(! mtEnd) {console.log('Failed to match end tag:', s.slice(0.20))
    return null
  }
  const tag = mtEnd[1]
  if(tag ! == stack[stack.length -1].tag) {
    console.log(The 'tag does not match,${tag}= >${stack[stack.length - 1].tag}`)
    return null}}Copy the code

After the success of the start and end tags match, should put the last item in the stack is also on behalf of the end of the match tag data out of the stack, and intercepting segment summary not remaining string matching to parse, and in every match to end tag to check whether the stack is empty, if the stack is empty has been matched to a full range of the parent node, Loop should exit

if (end <= start) {
  // ...
  stack.pop()
  s = s.slice(mtEnd[0].length)
  if (stack.length === 0) {
    break}}Copy the code

Special nodes

There is another problem here. We assume that string fragments are normal tag nodes like

, but legal nodes also include text nodes, blank runes, comment nodes, and self-closing tag nodes, all of which need to be considered

These nodes are special because they have no child nodes and no corresponding closing tags, so they need to be handled individually

For text nodes, anything that does not begin with a < string and does not contain a < is a text node (in fact, text nodes can also contain <, but for simplicity’s sake, I’ll leave it at that) :

// Matches the literal node/blank node before the label
function matchTextEmpty(s: string) {
  return s.match(/ ^ ^ <] + (? = <) /)}Copy the code

Zero-width positive predictive predictive predictors are used here, (? =<) represents the position before matching <

For blank nodes:

// Matches the blank runt before the label
function matchEmpty(s: string) {
  return s.match(/^\s+/)}Copy the code

For a comment node, the comment node must start with

// Matches the comment tag
function matchComment(s: string) {
  return s.match(/ ^ <! * - [^ >] -- > /)}Copy the code

For self-closing tag nodes, there are only those self-closing tags, which can be listed directly:

// Matches the self-closing label
function matchAutoCloseTag(s: string) {
  return s.match(/^<(input|img|br|hr|meta|link)[^>]*>/)}Copy the code

Encapsulate the matching logic for these particular nodes

function manageSingleChild(s: string) {
  // Text node/blank node
  let mt = matchTextEmpty(s)
  if (mt) {
    return {
      str: s.slice(mt[0].length),
      node: { nodeType: nodeTypes.text, content: mt[0]}}}// Empty node point
  mt = matchEmpty(s)
  if (mt) {
    return {
      str: s.slice(mt[0].length),
      node: { nodeType: nodeTypes.text, content: mt[0]}}}// Self-closing label
  mt = matchAutoCloseTag(s)
  if (mt) {
    return {
      str: s.slice(mt[0].length),
      node: genSingleStr(s)
    }
  }
  // Comment the tag
  mt = matchComment(s)
  if (mt) {
    return {
      str: s.slice(mt[0].length),
      node: { nodeType: nodeTypes.comment, content: mt[0]}}}return null
}
Copy the code

Before parsing the fragment, check to see if the fragment starts with one of these special nodes. If so, there is no need to go through the following child node parsing:

while (s.length) {
  const singleChildNode = manageSingleChild(s)
  if (singleChildNode) {
    stack[stack.length - 1].children.push(singleChildNode.node)
    s = singleChildNode.str
    continue
  }
  const end = s.indexOf('< /')
  const start = s.indexOf('<')
  // ...
}
Copy the code

Brother nodes

The next problem is that the top-level nodes have siblings, such as.container and.footer for the following HTML fragment

<div class="container"><p></p></div>
<div class="footer"></div>
Copy the code

Then we can add another loop, the outermost loop dealing with siblings, and the inner loop dealing with parent nodes. Similarly, sibling nodes can be non-self-closing tags, self-closing tags, text nodes, blank nodes, comment nodes, all of which need to be considered. This is similar to manageSingleChild

function genHTML(s: string) {
  const root = [] as TAstNode[]
  while (s.length) {
    const singleNode = manageSingleRoot(s)
    if (singleNode) {
      root.push(singleNode.node)
      s = singleNode.str
      continue
    }
    const stack = []
    while (s.length) {
      // ...}}return root
}
Copy the code

At this point, the logic of the entire HTML method is complete

summary

Implemented in this paper is just a simple version of HTML parsing method, is not complete, complete code method is certainly not this quantity can solve, but this is not the focus of this article, this article mainly is to want to based on an actual combat scene, use regular expressions to solve practical problems, learn how to use is the key, after all, teach them to fish than teach them to fish

The full code is on Github

The last

Bytedance – Live Realisation and recruitment!

No less than 10 HC (cheat is a puppy), Beijing, Shanghai, Hangzhou can be, and is urgent recruit, just a programmer of that kind

Internship is accepted without limitation

It doesn’t matter if you have sent bytes to other departments before (failing the interview does not necessarily mean your ability is not good, it may be that your eye edge is not enough), you can continue to meet my department (in case you see the right eye, after all, our department is really short of people), you can send your resume to my email kother@foxmail.com, all the people who send my resume, Make sure to follow up and give feedback on the progress of the interview, always answer questions (as long as it doesn’t violate company policy), and avoid the bad experience of having your resume tossed out of the window

Save the children! Come and share my needs!

Team to introduce

In charge of optimizing bytedance’s live streaming ads and short video e-commerce ads in China, and in charge of platform construction, algorithm optimization, and implementation of advertising products and operation strategies of Giant Qianchuan. The primary goal is to further increase the commercial revenue in China by leveraging byte’s powerful algorithmic engineering capabilities and giving full play to the natural advantages of the live streaming genre and e-commerce closed-loop. The life service business relies on Douyin, Douyin Speed Edition and other platforms to promote the connection between users and local services; In the past year, the life service business has created a new video planting and transaction experience, enabling more users to discover offline good places through Douyin, and helping local merchants expand into new business fronts. We look forward to your joining us. We hope you will be the next one to serve millions of merchants, influence hundreds of millions of consumers and lead the marketing revolution!

Team advantage

Technology: as the ultimate guide to business, even as a research and development, will also be able to contact a line of customers, through technical solutions to customer problems, technical solution related to recall in the advertising system, row, row, bid, sorting, and many other links, to an in-depth understanding of advertisements each link the internals of the system.
Growth: Byte e-commerce GMV is still improving at a high speed. When meeting purchase demands, short video and live broadcast have subversive advantages, and there is a great space for business growth.
Opportunities: The buying experience of byte e-commerce is more diversified, including commodities, video, live stream, talent, fan relationship, live interaction, etc. Compared with traditional shelving e-commerce, the scope is larger and the development opportunities for individuals are more.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

No concept! Regular expression combat hand – in – hand implementation of an HTML parsing function

Simple label data extraction

Child node processing

The node is not self-closing

Special nodes

Brother nodes

summary

The last

No concept! Regular expression combat hand – in – hand implementation of an HTML parsing function

Simple label data extraction

Child node processing

The node is not self-closing

Special nodes

Brother nodes

summary

The last

Related Posts

RUST is the future of JavaScript infrastructure

TS getting started – Generics

Koa source code analysis