Written on June 2, 2015, it may be out of date, please refer to it carefully

The original

I don’t know where IT came from:

If you have a problem that you think you can solve with re, then you have two problems.

I find regular expressions to be a really hard language to understand, even worse than XML. But it works really well. The trouble with regular expressions is that it’s hard to intuitively know what a re is going to do; Writing a re makes it hard to visualize how the machine will handle it. Therefore, unexpected or unexpected problems often arise. Today we’re going to talk about something that seriously affects performance, which I call the Backtracking trap or Catastrophic Backtracking.

back

Backtracking is not necessary for regex, it depends on the specific regex engine. Simply put, regular engines are divided into NFA and DFA. It’s hard to understand and boring, so I’ll stick to it. DFA (deterministic finite automaton), starts from matching text, from left to right, each character will not match twice, its time complexity is polynomial, so in general, it is faster, but supports few features, does not support capture groups, various references, and so on; NFA (non-deterministic finite automata) starts with the regular expression, continuously reads characters, tries to match the current re, does not match, spits out characters to try again, usually its speed is relatively slow, the optimal time complexity is polynomial, the worst case is exponential. But NFA supports more features, so in most programming scenarios (including JS), we’re dealing with NFA.

The NFA matching process is to eat the character, try to match, if passed, eat again try; If it does not pass, it will spit out and return to the previous state. Because there may be a different state transformation path for the same string in the re, the re engine will try another state transformation. If it passes, it will continue to eat characters, otherwise it will spit out characters and return to the previous state. This process of returning to the previous state after unsuccessful attempts is called backtracking. The performance of a regular match depends on the amount of backtracking, and the more backtracking, the worse the performance.

In order to clear this problem, we do an experiment, using/a (acd | BSC | BCD)/the regular to match the string “abcd”.

The screenshot shows the regular expression above, the text to be matched on the right, and the matching process on the left.

As you can see, it took eight steps to match these four letters, and let’s see what each of those eight steps did.

  • Step 1: Start with the first character of the regular expression, a, and enter the first character of “abcd”, also a, match!
  • Step 2: At this point, the regular expression encounters branches, followed by three possible matches, namely acD, BSC, and BCD. Select the first path ACd and enter the second character of abcd, which is b. The match failed. Then you need to do a backtrack, return the last character b, and discard the first path in favor of the second path BSC.
  • Step 3: In the second path BSC, the first character is b, eat the second character in the “abcd”, also b, match!
  • Step 4: in the second path BSC, the next character is s, and the third character in the “abcd” is C. Return the c and B you just ate, return to the state of step 2, and choose the third path BCD;
  • Step 5 to Step 7: Match the remaining characters in BCD and ABcd.
  • Step 8: Complete the match.

Backtracking is fairly common in regular expressions, given that the letter B matches three times in this simple example.

Nested quantifiers

Consider a re /(a*)*b/ to match “aaaaaaaaaa” (a string of 10 As). It doesn’t seem complicated, but try it out:

Oh, my God, it took 6,143 steps to complete! What if I add one more A, what if IT’s a string of 11 A’s?

It’s 12,287 steps. It’s doubled. As it turns out, when such quantifier nesting occurs, if the worst happens (the last character does not match), then the regex engine gets into a catastrophic backtracking with exponential time complexity.

If you try nesting another layer, the string of nine As will break a million steps…

Other situations

Most of the time, it’s not as extreme as the above, it’s more like adding a quantifier to a complex clause that already has a quantifier in it. Or there is a complex grouping in the clause. These situations are likely to occur in the real world, and while they may not be exponentially complex, they still pose significant performance challenges.

There is an example that I find interesting and useful for performance optimization. What is the example? Matches whether a number is prime with a regular expression. Er… It’s a bit of a leap, and it seems like it’s totally different, but it can be done. Let’s just keep it simple. We don’t have to worry about 1. First of all, I’m going to turn the number into a string, and I’m going to write as many 1’s as possible. For example, the 5 is converted to the string 11111 made up of five ones. The re used to match is /^(11+)\1+$/, and if the match passes, it is composite; It doesn’t say it’s prime. It’s not that complicated. I won’t go into it. Similarly, there are also some problems of using regularization to test integer solutions of binary first order equations, and the principle is similar.

This example is actually useless, because it’s fun so I’m impressed. What does that mean for us? Just don’t write so laborious re! In this example, there seems to be no quantifier nesting, but like other problems, there is a quantifier added to the reference value, and the reference is not certain, so there is still a lot of backtracking. So in addition to paying attention to nested quantifiers, complex subexpression quantifiers or grouping quantifiers, we also need to pay attention to reference quantifiers, which I haven’t seen anyone else mention.

Some solutions

Quantifier operations

In the case of quantifier nesting, some simple operations can eliminate nesting:

(a*)* <==> (a+)* <==> (a*)+ <==> a*
(a+)+ <==> a+
Copy the code

Very simple, not much to say.

Possessive Quantifiers

This is kind of interesting, but javascript is not supported yet, so LET me make it very brief. It’s easy to use, just add a + after the quantifier. Something like a++b/, so what’s the difference between that and a+b/? Possessive priority quantifiers do not preserve the traceback state, in other words, the former cannot be traceback. If the match is successful, it makes no difference; If the b match is unsuccessful, the former will not be backtracked, but the latter will be backtracked again.

Solidification Grouping (or Atomic Grouping, Atomic Grouping)

This one is even more interesting, because it controls a whole string that doesn’t backtrack. The usage goes like this /(? > / ABC). Well, unfortunately javascript is still not supported. Keep an eye out for this feature in other languages though, as it is more compatible than possessive quantifiers (for example, c# supports atomic grouping rather than possessive quantifiers), and it mimics possessive quantifiers and is more flexible to use.

Reference:

  • Html-js.com/article/284…
  • www.regular-expressions.info/catastrophi…
  • www.regular-expressions.info/possessive….
  • www.regular-expressions.info/atomic.html