Regular expression usages that are not easy to figure out on your own

Original: Coding diary (wechat official ID: Codelogs), welcome to share, reprint please reserve the source.

Introduction to the

Regular expressions are pretty much a must-have skill for programmers these days. They’re easy to get started with, but if you don’t learn them carefully, you’ll end up stuck in the basics of regular expressions. So, in this article, I’ll cover some of the regex scenarios that might take some time to implement or might not work out at all, as well as some of the regular expression performance issues.

Match multiple words

For example, I want to match Zhangsan, Lisi, wangwu these three names, this is a very common scene, in fact, in the re is also basic skills, but in view of my initial entry or online search for the answer, or worth mentioning! The implementation is as follows:

zhangsan|lisi|wangwu
Copy the code

| said or meaning, which is matching zhangsan lisi or wangwu.

Matching repeated digits

Match the repeated 4-bit numbers such as 1111, 2222 and 3333, and then I think it is solved without \d{4}, because \d{4} can match 1111, but it can also match 1234. It is written as follows:

(\d)\1{3}
Copy the code

\d matches the first digit, followed by \1 matching the previous one, and repeated three times to match 4-digit strings such as 1111 or 2222.

Match various whitespace

The POSIX character class \p{Space} can be used to match ASCII whitespace and Unicode whitespace. In other languages, the POSIX character class \p{Space} can be used to match Unicode whitespace. Maybe the regular syntax is a little different.

Position matching

In regular expressions \G and circumnavigation are quite difficult to understand, because these two things are only introduced in many books matching rules, without telling the essence, leading to the rule of memorization after a period of time forgotten, also do not understand the use of these two things.

Let’s change our thinking. In fact, in regular expressions, there are only two matching objects, one is to match characters in the string, the other is to match positions in the string, as shown below:

Hello, above, has five characters to match, plus six positions to match, and^helloIn the^That’s where the match starts, so if it’s_helloCan’t be^helloMatch, because_withhThe position between is not the beginning and cannot be^Match!

Common position matching rules

The rules	Matching position
^ \A	Match starting position
$ \z \Z	Matching end position
\b \B	Matches word and non-word boundary positions
\G	Matches the start position of the current match
(? =a) (? ! a)	Look forward to see if the current position is followed by an A or not
(? <=a) (? <! a)	Look around backwards to see if the current position is preceded by an A or not

^ and \ A

^ Matches the start of the text, but in multi-line matching mode, ^ matches the start of each line.

\A can only match the starting position, no matter what matching mode

$与\Z

$Matches the end of the text, but in multi-line matching mode, $matches the end of each line.

\Z can only match the end position, no matter what matching mode

\ \ b and b

\b Matches word boundaries. In Java, word boundaries are the positions between letters and non-letters, which are not considered words in Chinese, and the beginning and end of text are also word boundaries

\B Matches non-word boundaries

\G

Matches the end position of the last match or the start position of the current match, and the start position of the text for the first match, as follows:

If you use \d to find a single number in 1234a5678, you can find 8, but if you use \G\d to find only 4

Search process:

On the first search, \G matches the start of the text, 1 matches \d, and the first match, 1, is found

On the second search, \G matches the position between 1 and 2, 2 matches \d, and the second match, 2, is found

On the third search, \G matches the position between 2 and 3, and 3 matches \d, and the third match, 3, is found

On the fourth search, \G matches the position between 3 and 4, and 4 matches \d, finding the fourth match, that is, 4

On the 5th query, \G matches the position between 4 and 5, but a and D do not match.

Look around the

(? = a) and (? ! a)

The positive (negative) loop is used to check if the character following the current position is a or not a

(? < = a) and (? <! a)

Reverse affirmative (negative) look to check whether the current position is preceded by a character or not

Below, find the word wrapped by (), using the circle to qualify the word to the left(The right is)

Position can be matched multiple times to a position in the text. Multiple rules can be matched at the same time, regardless of the order of the rules in the regular expression. For example, the following three regular expressions are equivalent:

^abc ^^^^^^abc ^(? =a)\b^^^abcCopy the code

Here are two practical examples of position matching!

Example 1: Verify password strength When verifying password strength, a password must contain 8 to 10 characters and must contain digits, letters, and punctuation marks. The password can be verified using a regular expression as follows:

^ (? =. * ([0-9])? =.*[a-zA-Z])(? =. * \ p {p}). 8, 10 {} $Copy the code

Among them,(? =. * [0-9])It means there must be a number after the beginning position,(? =.*[a-zA-Z])It means there must be a letter after the beginning position,(? =.*\p{P})There must be a punctuation mark after the opening position,. {8, 10}Matches 8 to 10 characters. These re’s combine to verify password strength.

Example 2: sometimes we need to change 123456789 to 123,456,789. This can be done using the re as follows:

(? ! (^)? =(\d{3})+$)Copy the code

Among them,(? =(\d{3})+$)That’s the matching position, and that position must be followed by one or more sets of three digits, and there are three positions that satisfy this condition, the position between the beginning and 1, the position between 3 and 4, the position between 6 and 7, and then(? ! ^)I’m limiting the same number of places, not the beginning, so I’m only going to be able to do it between 3 and 4,6 and 7, so when I make the substitution, I’m going to be able to do it123456789.

Matches the quoted string

To match quoted strings such as “hello,world”, it’s easy to think of “[^”]+”, but what if \ is allowed to escape “inside quoted strings, such as “hello \” Bob \”!” If you use “[^”]+” to match, you will only get “hello \”. . . . Can’t think of anything? Instead, a string with a \ beginning escape character can be broken down to “,hello,\” Bob,\”! , “and then a generalization for the regular form,”, [^ \ \ “] * \ \ [^ \ \ “] * \ \ [^ \ \] *, “”,” together as follows:

"[^ \ \] * (?" : \ \ [^ \ \] "* *"Copy the code

There is an extra (? :), which represents a non-capture grouping, and can be used to improve the performance of regular matching. :\\.[^\ “]*) is followed by *, which matches everything in the quotation marks directly with [^\ “]*.

Don’t blow up the CPU

If you write a complex regular expression, you need to be careful to evaluate it, because it can work fine in normal times, but in some special cases, the CPU will go 100%. For example, if you match the quoted string, you might give the following re:

"([^ \ \"] + | \ \.) *"Copy the code

At first glance, this re looks perfect, [^\\”]+ matches the part of the non-escaped character, \\. Match \”, \n and so on. This re has no problem with strings that satisfy conditions (such as “hello \” Bob \”!”). ), and meet the string does not meet the conditions, regular will along with the length of the string matching complexity exponentially in rise, leading to 100% CPU, such as “hello \” Bob \ “!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Where “no closure.

public static void main(String[] args) {
    long begin = System.currentTimeMillis();
    boolean isMatch = "\"hello \\\"bob\\\"!!!!!!!!!!!!!!!!!!".matches(([^ \ \ \ \ '\' \] + | \ \ \ \ "") * \ "");
    System.out.println(String.format("%s ms, isMatch: %s", System.currentTimeMillis() - begin, isMatch));
}
Copy the code

This Java code, on my machine, looks like 2s, but if I add four more to the string! , the running time immediately increased to 17s, and the performance drop was horrible!

This is called backtracking. For example,”.*” matches “hello” in the string, and then.* matches h,e, L, L, O,” in sequence. Finally, the “match” in the re doesn’t match the end of the string, so the re engine will tell the previous.* to spit out what it matched, and then spit out what matches the “in the re, and the match is successful.

If the string is “hello “and there are no closed characters, the.* will spit characters until it has no more characters to spit, and then the match will be considered a failure.

Yes, the re contains match quantifiers, right? ,*,+, you can imagine that they keep eating characters, forcing them to spit them out again when the following rules don’t match, whereas if it’s lazy matching quantifiers?? , *? , +? “, you can imagine it not eating and then forcing it to eat when the rules don’t match.

We’ll analysis under “([^ \ \”] + | \ \.) * “match”!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Why is it so inefficient! Note: I have simplified the string to be matched for analysis purposes, but the effect is the same

First of all,[^ \ \] +.ate!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.
And then find the re in"Does not match the end of the string.
then[^ \ \] +.Spit out a!Notice here, because there’s another one on the outside*Greedy quantifier, spit it out!Has been"] + [^ \ \ | \ \.In the[^ \ \] +.After it eats, it reaches the end of the string and finds that the end is in the re"Do not match, and demand"] + [^ \ \ | \ \.In the[^ \ \] +.Spit out what you just ate!“And then they don’t match.
And then he forced the first one[^ \ \] +.Spit out the penultimate!“, pay attention, spit it out again!The current matching position is followed by two!And, alas, these two!Is the back"] + [^ \ \ | \ \.In the[^ \ \] +.It eats it, and then it repeats itself, and it spits it out again, and so on, exponentially.

The solution to this problem is that the regular expression has two quantifiers, one inside and one outside, and if you don’t believe me, You can try to use ^ (+ a) * $to match aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa0, also will be very slow. There are two ways to solve this problem.

let[^ \ \] +.Spit characters cannot be eaten by another greedy self in the outer re, as described earlier"[^ \ \] * (?" : \ \ [^ \ \] "* *".[^ \ \] *.Spit out the character, is unable to be\ \ [^ \ \] *.It was eaten, because it wasn’t spit out\And the\ \ [^ \ \] *.You have to eat one first\.
Knowing that the character you spit won’t match the rule behind it, let the quantifier eat the character and not spit it out, such as changing the re to"([^ \ \] + + | \ \.) *"In this way,+Turned out to be++, like this quantifier+Size, for example? +.* +.++“, which indicates possession of quantifiers, and will not vomit after eating characters.

^.+b$will match ab, but if you use ^.++b$it will not match AB, because ^.++b$will match ab. Eat ab, spit out ab just enough to match the b behind. ^[^b]++b$^[^b]++b$

conclusion

Regular expressions are powerful and can be used with much less effort, but you also need to understand how they are executed to avoid the exponential backtracking trap.

Content of the past

The parallel command is still messing with the connection idle time. Use SOCAT to batch operate multiple machines to improve work efficiency, jq command to help you (4)