Original text: introduction to regular expressions principle | AlloyTeam author: TAT. Liberty

Most of you have probably used regular expressions, but have you ever thought about the principles behind regular expressions while using them, or are you surprised when I tell you that regular expressions might have performance issues that cause them to hang online?

Let’s start with an example of an online accident in early July due to a regular expression.

Blog.cloudflare.com/details-of-…

Simply put, a regular expression with a performance problem that causes catastrophic backtracking, causing the CPU to load up.

Re for performance problems

Let’s look at the re in question

The key part that causes performance problems is.*(? *=.*), we ignore the non-capture group and treat the re for the performance problem as.*.*=.*.

Where. Means to match any character except newline, and.* means to match any character any number of times.

What is backtracking

Backtracking is the act of trying to follow a failed matching path when using greedy or lazy matching or or matching to enter a matching path selection.

Can be understood as a maze, a road to the end, found no way to go back to a fork in the choice of another road.

Back in the phenomenon

// Performance problems re
// Paste the following code into your browser console and try it out
const regexp = `[A-Z]+\\d+(.*):(.*)+[A-Z]+\\d+`;
const str = `A1:B$1,C$1:D$1,E$1:F$1,G$1:H$1`
const reg = new RegExp(regexp);
start = Date.now();
const res = reg.test(str);
end = Date.now();
console.log('Regular re execution time :' + (end - start))
Copy the code

Now what does backtracking look like

Suppose we have a regular (.*)+\d with a string of abcd. Note that only a string of length 4 is entered.

A backtracking of the matching process is shown above, roughly describing the first three rounds of matching.

Note that (.*)+ can be treated as multiple executions of.*. {1} (. *)

First match, because.* can match any character any times, so we can select null, a, AB, ABC, abcd, because * greedy, so.* directly matches abcd, + because there is no other character, so.* does not match abcd. Here the value of + is recorded as 1, and then \d has nothing to match, so the match fails and the first backtrace is performed.

For the second match, because of the backtracking, the last match was abcd, and the route was blocked, so this time we can only try to match ABC, and this time there is a d at the end, so we can understand that.* the first match is ABC, and then because of (.*)+, * can match d, where the value of + is recorded as 2, and then \d has nothing to match, so the match fails and a second backtrace is performed.

Third match, because the back, so back to when a matching route choice on this, the initial one. * matching is ABC, the second one. * matching is d, and road impassability, so for the second time here. * don’t match, this time at the end of a d, a/d and d match, to back a third time.

How can backtracking be reduced or avoided

  • Optimize regular expressions: Always be aware of the performance impact of backtracking.
  • Regular expressions using the DFA regular engine

What is a DFA regular engine

The traditional regular engine is divided into NFA (non-deterministic finite state automata) and DFA (deterministic finite state automata).

DFA

For any given state and input character, the DFA will only move to a certain state. And the DFA does not allow state transitions without input characters.

For example, state 0, when you enter character A, there is only one endpoint, and you can only get to state 1.

NFA

For any state and input character, the state that NFA can transfer is a non-empty set.

For example, state 0, when you enter the character A, you can have multiple endpoints, either state 1 or state 0.

The difference between DFA and NFA’s regular engine

So with all this said, what is the difference between DFA and NFA regular engines? Or how do DFA and NFA implement regular engines?

DFA

The DFA engine inside the re essentially converts the regular expression into a graph adjacency list, and then determines whether a string matches the re in the form of a hop list.

// Let's try it out
function machine(input) {
    if (typeofinput ! = ='string') {
        console.log('Input error');
        return;
    }
    // For example, after the re: / ABC/is converted to DFA
    // Here we define four states: 0,1,2,3. The initial state is 0
    const reg = {
        0: {
            a: 1,},1: {
            b: 3,},2: {
            isEnd: true,},3: {
            c: 2,}};let status = 0;
    for (let i = 0; i < input.length; i++) {
        const inputChar = input[i];
        status = reg[status][inputChar];
        if (typeof status === 'undefined') {
            console.log('Match failed');
            return false; }}const end = reg[status];
    if (end && end.isEnd === true) {
        console.log('Match successful');
        return true;
    } else {
        console.log('Match failed');
        return false; }}const input = 'abc';
machine(input);
Copy the code

Advantages: No matter how bad the regular expression is written, the matching speed is very fast

Cons: Advanced features such as capture groups and assertions are not supported

NFA

The NFA engine in the re is actually a directed graph constructed during parsing. And then through deep search, go path by path recursive attempt.

Advantages: Powerful, can get the matched context information, support various assertions capture group loop and other functions

Disadvantages: high requirements for the development of regular foundation, need to pay attention to the performance problems caused by backtracking

conclusion

Now back to the beginning of the question, why does his re have performance problems

  1. First of all, his re uses NFA’s re engine (most of the re engines of languages are NFA, js is also, so pay attention to the impact of performance problems).
  2. He wrote regular expressions that had performance problems and were prone to catastrophic backtracking.

To avoid such problems, either raise your development awareness of the performance issues of regex or switch to DFA’s regex engine (which is fast, weak, and does not capture group assertions, etc.).

Matters needing attention

In the ordinary writing of re, write less fuzzy matching, the more accurate the better, fuzzy matching, greedy matching, lazy matching will bring backtracking problems, choose a way as little influence as possible. Keep a performance problem in mind when writing regex.

If you are interested in dFA’s regular engine, you can check it out.Github.com/libertyzhao…


AlloyTeam welcomes excellent friends to join. Resume submission: [email protected] For details, please click Tencent AlloyTeam to recruit Web front-end engineers.