instructions

Today WHEN I looked at the parsing HTML section of the PARSING SFC(Single File Component) section of the Vue source code, I saw a long list of regular expressions. Location in the/SRC/compiler/parser/HTML – parser. Js: 16

const attribute = /^\s*([^\s"'<>\/=]+)(? :\s*(=)\s*(? :"([^"]*)"+|'([^']*)'+|([^\s"'=<>`]+)))? /
Copy the code

Mainly used in/SRC/compiler/parser/HTML – parser. Js: 189-209:

function parseStartTag () {
    const start = html.match(startTagOpen)
    if (start) {
      const match = {
        tagName: start[1].attrs: [].start: index
      }
      advance(start[0].length)
      let end, attr
      while(! (end = html.match(startTagClose)) && (attr = html.match(attribute))) { advance(attr[0].length)
        match.attrs.push(attr)
      }
      if (end) {
        match.unarySlash = end[1]
        advance(end[0].length)
        match.end = index
        return match
      }
    }
  }
Copy the code

The purpose of this code is to match a start tag from an HTML string, and then match all the attributes of the start tag into an array. Believe me and not just me, it would be difficult for anyone to write such a regular expression in a short period of time. So today I will take a detailed look at how this complex regular expression is implemented, and what it can and cannot match. Which incidentally will introduce some regular basic content, master do not spray. The article is slightly longer, Be Patient.

Divide and conquer

const attribute = /^\s*([^\s"'<>\/=]+)(? :\s*(=)\s*(? :"([^"]*)"+|'([^']*)'+|([^\s"'=<>`]+)))? /
Copy the code

At first glance, this regular expression is very long, and people who are not familiar with regular expressions may be surprised, or even directly skip over. One way to do this is to “divide and conquer” : take a long regular expression and break it up into shorter expressions to understand. The above expression can be preliminarily divided as follows:

/^\s*   ([^\s"' < > \ / =] +)? :\s*(=)\s*(? :"([^"] *)"+|'([^'] *)'+|([^\s"'=<>`] +)))? / (1) (2) (3)Copy the code

The first part

First, look at the (1) part of the annotation, which is the simplest: ^ indicates the start of the matching input, \s indicates a space, and * indicates whether or not there can be more than one. Then the meaning of the whole first part is clear: the input character can be preceded by no, no, or multiple Spaces. If this block alone matches as an expression:

const part1 = /^\s*/;
'abc'.match(part1); // Matches an empty string
' abc'.match(part1); // Matches a space
' abc'.match(part1); // Match two Spaces
Copy the code

The second part

Next, part 2: ([^\s”‘<>\/=]+) : First, part 2 is wrapped in a (). This is called capture grouping in re. What does that mean? “Capture” and “group”, that is, the results of this part of the match will be captured as a group. Capture is to satisfy the whole large regular expression on the basis of the grouping expression of the string as a small grouping result into the large result array. Such as:

const group = /a(.*)a/;
`a1232a`.match(group); // => ['a1232a', '1232'];
// Result [0] is a match that satisfies the entire expression, and result [1] is a small grouping of results that satisfies the expression in () within one of the larger results
Copy the code

After seeing the above, we perform the brain stack and jump back from the study of () to look at the second part of the expression.

Within () is the immediate [] part and a +. [] indicates that the content is a collection of characters, which is mainly used to restrict characters. The ^ character appears at the first character in []. The ^ character here is completely different from the ^ character that just appeared, because this is the first character that appears in the character set, which means “not”, which means that the character in the character set cannot appear. What are the characters that can’t appear? They are: \s,”,’,<,>,\/,=(space, double quotation marks, single quotation marks, less than, greater than, right slash). These cannot occur, which means that any other character can. If there are at least one of them, then there are more than one of them.

So now we know what this expression is going to match? A string of characters other than Spaces, double quotes, single quotes, less than, greater than, and right slashes! Such as:

const part2 = /([^\s"'<>\/=]+)/;
'name'.match(part2); // => ['name', 'name'];
' name'.match(part2); // => ['name', 'name']; Why does this match up? There is no '^' constraint in front of the regular expression.
Copy the code

So let’s put the first two parts together:

const part1_2 = /^\s*([^\s"'<>\/=]+)/;
'name="benchen"'.match(part1_2); // => ["name", "name", index: 0, input: "name="benchen""]
' +="benchen"'.match(part1_2); // => [" +", "+", index: 0, input: " +="benchen""]
' ="benchen"'.match(part1_2); // => null  
// Why? The first space satisfies the first part of the match, but is followed by an equal sign
// The '=' character is forbidden in the second part of the match, so no result can be matched.
Copy the code

Part 3 (Hang on)

Now, the third part, which looks very complicated.

(? :\s*(=)\s*(? :”([^”]*)”+|'([^’]*)’+|([^\s”‘=<>]+)))? `

There’s one at the end of part three, right? , just introduced the * and +,? I can’t have it. So let’s just sum up a little bit :(yes here means there is a)

  • ?: Can’t there be (no or one)
  • +: Can there be more than one (at least one)
  • *Can you have more than one (any one)

So, back again, that means that the third part of the match, the match of this group, can satisfy or not satisfy.

Let’s break down the third group using the divide-and-conquer method:

(? : \s*(=)\s* (? :"([^"] *)"+ | '([^']*)'+ | ([^\s"'= < > `] +)))? (1) (2) (3) (4)Copy the code

Part 1: optional Spaces followed by a required equal sign, waiting for more than one space.

Part two: How many random strings of non-double quotes are between double quotes. So “ABC” does, “”” doesn’t.

Part 3: As in Part 2, replace double quotes with single quotes

Part 4: A non-empty string consisting of non-spaces, double quotation marks, single quotation marks, equal signs, less than signs, greater than signs, and single quotation marks (‘).

Note: parts 2, 3, and 4 are or relationships, as long as any of them are satisfied.

integration

It’s finally time to put it all together, see what the filter picks up.

/^\s*([^\s"' < > \ / =] +)? :\s*(=)\s*(? :"([^"] *)"+|'([^'] *)'+|([^\s"'= < > `] +)))? /Copy the code

The input string can be preceded by any number of Spaces, followed by a string that contains no Spaces, double quotation marks (“), single quotation marks (“), equal signs (“), less than (“), greater than (“), and single quotation marks (“), and may or may not be followed by a third group. If there is a third group, it must satisfy the following logic: optional space followed by an equal sign, followed by no space, followed by a string wrapped in double quotes, which cannot contain double quotes. The value can be a non-empty string consisting of Spaces, double quotation marks (‘), single quotation marks (‘), equals (=), less than (), greater than (), and single quotation marks (‘).

I’ll give up. I’ll admit that human language is nowhere near as expressive as regular expressions. Let’s look at an example:

const attribute = /^\s*([^\s"'<>\/=]+)(? :\s*(=)\s*(? :"([^"]*)"+|'([^']*)'+|([^\s"'=<>`]+)))? /

// There are five capture groups, so the result array should have six values.
// I have omitted the index, input, length attributes in the following results for convenience.

'name="benchen"'.match(attribute); // The simplest
//=> ["name="benchen"", "name", "=", "benchen", undefined, undefined]
' name="benchen"'.match(attribute); // Preceded by a space
//=> [" name="benchen"", "name", "=", "benchen", undefined, undefined]
' name = "benchen"'.match(attribute); // There are Spaces before and after the equals sign
//=> [" name = "benchen"", "name", "=", "benchen", undefined, undefined]
` name = 'haha'`.match(attribute); // Values are wrapped in single quotes
//=> [" name = 'haha'", "name", "=", undefined, "haha", undefined]
` name = haha`.match(attribute); // The value is not wrapped
//=> [" name = haha", "name", "=", undefined, undefined, "haha",]
'name'.match(attribute); // Only the attribute name has no value
//=> ["name", "name", undefined, undefined, undefined, undefined]
'+ = +'.match(attribute); // Make a pervert
//=> ["+=+", "+", "=", undefined, undefined, "+"]
'@click="clickHandler"'.match(attribute); // VUE event binding
//=> ["@click="clickHandler"", "@click", "=", "clickHandler", undefined, undefined]
':name="name"'.match(attribute); // Data transfer
//=> [":name="name"", ":name", "=", "name", undefined, undefined]
'v-model="model"'.match(attribute); // Data transfer
//=> ["v-model="model"", "v-model", "=", "model", undefined, undefined]
Copy the code

Could not match the result input

'="benchen"'.match(attribute) // null, the initial '=' does not match part 2,
Copy the code

The input that should not be matched

'name=="benchen"'.match(attribute);
//=> ["name", "name", undefined, undefined, undefined, undefined]
// In my opinion, the input above should not match the result, this is probably the re is not perfect, not a bug.
Copy the code

conclusion

Regular expressions, no matter how complex, are composed of several groups, which can be analyzed or designed to reduce the complexity of understanding.

🔗 original link