Since the previous article: regular expressions are really slutty, but you can’t write them!! After the publication, many netizens said that why did not say the assertion did not say the reverse did not say greed… “, even old iron said my pants are off you to tell a little, ha ha ha, well, while mangokhut typhoon forced holiday at home time, the regular rest of some knowledge to tell, I hope you like, I hope the old iron can take off his pants to read the operation.
The purpose of this article is to present the most boring basic knowledge in the most popular language.
Article outline:
- Zero width assertion
- Capture and non-capture
- backreferences
- Greed and non-greed
- antisense
1. Zero-width assertion
Neither zero width nor assertion sounds quaint, so let’s explain these two words.
- Assertion: The common assertion is “I say something,” whereas the assertion in a regular is that a regular can indicate that something will come before or after a given thing that meets a given rule. The meaning of the regular can also be determined in the human way. For example, “SS1aa2bb3 “, a regular can find that aa2 is preceded by bb3. You can also find aa2 followed by SS1.
- Zero width: no width. In re, the assertion only matches the position, not the character. That is, the assertion itself is not returned in the result of the match.
That means it’s clear. What good is he? For example, suppose we want to use a crawler to crawl the number of articles read in CSDN. If you look at the source code, you can see the structure of the reading volume of the article
1"<span class="read-count">Readings: 641</span>"
Copy the code
Only ‘641’ is a variable, which means different articles have different values. When we get this string, there are many ways to get the ‘641’ on this edge, but how do we match it using the re?
Here are a few types of assertions:
- Positive antecedent assertion (positive outlook) :
- Grammar :(? The pattern of =)
- Action: Matches the previous content of a pattern expression, and does not return itself.
< span style = “box-sizing: border-box; color: RGB (74, 74, 74); font-size: 14px! Important; white-space: inherit! Important;” =) to match the previous content. What to match? If you want everything, it’s:
1String reg=". + (? =)";
2
3String test = " ;
4Pattern pattern = Pattern.compile(reg);
5Matcher mc= pattern.matcher(test);
6while(mc.find()){
7 System.out.println("Matching result:")
8 System.out.println(mc.group());
9}
10
11// Match result:
12//
Copy the code
\d \d \d \d \d \d
1String reg="\\d+(? =)";
2String test = " ;
3Pattern pattern = Pattern.compile(reg);
4Matcher mc= pattern.matcher(test);
5while(mc.find()){
6 System.out.println(mc.group());
7}
8// Match result:
9/ / 641
Copy the code
And we’re done!
- Forward back assertion (forward back assertion) :
- Grammar :(? < = the pattern)
- Action: Matches the content following a pattern expression and does not return itself.
There is first and then there is back. First is to match the content in front, then the back is to match the content behind. The chestnuts above, we can also handle with a later assertion.
1/ / (? <= number of reads :) \d+
2String reg="(? <= number of readings :) \\d+";
3
4String test = " ;
5Pattern pattern = Pattern.compile(reg);
6Matcher mc= pattern.matcher(test);
7 while(mc.find()){
8 System.out.println(mc.group());
9 }
10// Match result:
11/ / 641
Copy the code
It’s that simple.
- Negative preemptive assertion (negative outlook)
- Grammar: (? ! pattern)
- Action: Matches the beginning of a non-pattern expression and does not return itself.
There are positive and negative, and negative in this case means right and wrong. For example, if there is a sentence “I love the motherland, I am the flower of the motherland”, now to find the motherland in front of the “flower”, you can write like this:
1The motherland (? ! Flowers)
Copy the code
- Negative back assertion (negative back assertion)
- Grammar: (?
- Action: Matches the rest of a non-pattern expression and does not return itself.
2. Capture and non-capture
By capture, he means matching expressions, but capture is often associated with groups, or “capture groups.”
Capture groups: Matches the contents of subexexpressions, saves the matches to an in-memory numbered or display-named group, numbers them depth-first, and then uses them by sequence number or name.
According to the different naming method, it can be divided into two groups:
- Number numbering Capture group: Syntax: (exp) Explanation: Starting from the left side of the expression, the content between each opening parenthesis and its corresponding closing parenthesis is a group. In a group, the 0th group is the entire expression, and the first group starts as a group. For example: 020-85653333 for a landline phone: (0\d{2})-(\d{8})
The serial number | Serial number | grouping | content |
---|---|---|---|
0 | 0 | (0\d{2})-(\d{8}) | 020-85653333 |
1 | 1 | (0\d{2}) | 020 |
2 | 2 | (\d{8}) | 85653333 |
Let’s use Java to verify:
1String test = "020-85653333";
2 String reg="(0\\d{2})-(\\d{8})";
3 Pattern pattern = Pattern.compile(reg);
4 Matcher mc= pattern.matcher(test);
5 if(mc.find()){
6 System.out.println("The number of groups is:"+mc.groupCount());
7 for(int i=0; i<=mc.groupCount(); i++){
8 System.out.println("The first"+i+"The groups are:"+mc.group(i));
9 }
10 }
Copy the code
Output results:
1The number of groups is:2
2The first0The three groups are:020- 85653333.
3The first1The three groups are:020
4The first2The three groups are:85653333
Copy the code
As you can see, the number of groups is 2, but since the 0th is the whole expression itself, it is printed together.
- Named number capture group: Syntax: (?
exp) Explanation: The group is named by the name in the expression.
\0\d{2})-(?
\d{8}) This expression is grouped as follows in the order of the left parentheses:
The serial number | The name of the | grouping | content |
---|---|---|---|
0 | 0 | (0\d{2})-(\d{8}) | 020-85653333 |
1 | quhao | (0\d{2}) | 020 |
2 | haoma | (\d{8}) | 85653333 |
Verify this with code:
1String test = "020-85653333";
2 String reg="(?
0\\d{2})-(?
\\d{8})"
;
3 Pattern pattern = Pattern.compile(reg);
4 Matcher mc= pattern.matcher(test);
5 if(mc.find()){
6 System.out.println("The number of groups is:"+mc.groupCount());
7 System.out.println(mc.group("quhao"));
8 System.out.println(mc.group("haoma"));
9 }
Copy the code
Output results:
1The number of groups is:2
2The group name is Quhao, and the matching content is:020
3The group name is: Haoma, and the matching content is:85653333
Copy the code
- Non-capture group: Syntax: (? Exp) Explanation: as opposed to capture groups, it is used to identify groups that do not need to be captured. In plain English, you can save your groups as needed.
For example, in the regular expression above, the program does not need to use the first group, so it can be written like this:
1(? : \0\d{2})-(\d{8})
Copy the code
The serial number | Serial number | grouping | content |
---|---|---|---|
0 | 0 | (0\d{2})-(\d{8}) | 020-85653333 |
1 | 1 | (\d{8}) | 85653333 |
Verify:
1String test = "020-85653333";
2 String reg="(? :0\\d{2})-(\\d{8})";
3 Pattern pattern = Pattern.compile(reg);
4 Matcher mc= pattern.matcher(test);
5 if(mc.find()){
6 System.out.println("The number of groups is:"+mc.groupCount());
7 for(int i=0; i<=mc.groupCount(); i++){
8 System.out.println("The first"+i+"The groups are:"+mc.group(i));
9 }
10 }
Copy the code
Output results:
1The number of groups is:1
2The first0The three groups are:020- 85653333.
3The first1The three groups are:85653333
Copy the code
3. Backreference
When we talk about capture, we know that capture returns a capture group, which is stored in memory and can be referenced not only outside the regular expression through the program, but also inside the regular expression. This kind of reference is called backreference.
According to the naming rules of the capture group, backreferences can be divided into:
- Number number group backreference: \k or \number
- Named number group backreference: \k or \’name’
Okay, that’s it, okay? Don’t understand!! Maybe you don’t even understand the use of the previous capture, right? In fact, just do not understand the capture will not use is very normal! Because capture groups are usually used in conjunction with backreferences
As mentioned above, the capture group is the contents of the matching subexpression that are stored by ordinal or named for use. Note that two words: “content” and “use”. The “content” in this case is the result of the match, not the subexpression itself. Well, remember, how does “use” mean?
Because it is mainly used to find some duplicate content or to replace the specified character.
For example, finding pairs of letters in a string of letters “aabbbbGBdDESDDfiid” would be impossible if we followed the regular pattern we learned earlier.
- 1) A letter is matched
- 2) Match the next letter and check whether it is the same as the last letter
- 3) If they are the same, the match succeeds; otherwise, the match fails
Here in thought 2, when matching the next letter, we need to use the previous letter, so how to remember the previous letter?? First match a letter: \w we need to make groups to capture, so write: (\w)
So this expression has a capture group :(\w) and then we need to use the capture group as a condition, so we can have :(\w)\1 and we’re done. Remember there are two ways to name a capture group, one is to name it in the order of the capture group, and the other is to name it in a custom way. By default, all capture groups are named by numbers, and the numbers start with 1 so you refer to the first capture group, The backreferenced numeric naming convention requires either \k<1> or \1 of course, it’s usually the latter. Let’s test it out:
1String test = "aabbbbgbddesddfiid";
2 Pattern pattern = Pattern.compile("(\\w)\\1");
3 Matcher mc= pattern.matcher(test);
4 while(mc.find()){
5 System.out.println(mc.group());
6
7 }
Copy the code
Output results:
1aa
2bb
3bb
4dd
5dd
6ii
Copy the code
Well, that’s what we want. As an example of substitution, suppose you want to replace ABC with a in the string
1String test = "abcbbabcbcgbddesddfiid";
2String reg="(a)(b)c";
3System.out.println(test.replaceAll(reg, "$1"));
Copy the code
Output results:
1abbabcgbddesddfiid
Copy the code
4. Greed and non-greed
1. Greed
As we all know, greed is not satisfied, want as much as possible. In re, greed means something similar:
Greedy matching: When a regular expression contains qualifiers that can accept repetition, the usual behavior is to match as many characters as possible (as long as the entire expression can be matched). This is called greedy matching. Features: one-time read the entire string matching, when does not match one character at a time, to abandon the right to match, the match and in turn abandon (the match – also called back up way), until the whole string matching success or abandon the end, so it is a kind of to maximize return of data, more not less.
We talked about repeated qualifiers earlier, but these qualifiers are greedy quantifiers, such as expressions:
1\ d {3, 6}
Copy the code
This is used to match three to six digits, in which case it is a greedy pattern match, i.e. if there are six digits in the string that can be matched, it matches all of them. Such as
1String reg="\ \ d {3, 6}";
2String test="61762828, 176, 2991, 871";
3System.out.println("Text:"+test);
4System.out.println("Greed Mode:"+reg);
5Pattern p1 =Pattern.compile(reg);
6Matcher m1 = p1.matcher(test);
7 while(m1.find()){
8 System.out.println("Matching result:"+m1.group(0));
9 }
Copy the code
Output results:
1Text:61762828 176 2991 44 871
2Greedy mode: \d{3,6}
3Matching result:617628
4Matching result:176
5Matching result:2991
6Matching result:871
Copy the code
As can be seen from the result: in the original string “61762828”, in fact, it only needs 3 (617) to be successfully matched, but it is not satisfied. Instead, it matches the maximum number of matched characters, namely 6. If one quantifier is so greedy, then one might ask, if multiple greedy quantifiers are put together, how do they control their matching power?
Well, when multiple greeders are in a group, if the string satisfies the maximum number of their matches, they don’t interact with each other, but if they don’t, they follow the depth-first principle, that is, each greeder from left to right, first satisfies the maximum number, and then allocates the rest to the next quantifier.
1String reg="(\ \ d {1, 2}) (\ \ d {3, 4})";
2String test="61762828, 176, 2991, 87321";
3System.out.println("Text:"+test);
4System.out.println("Greed Mode:"+reg);
5Pattern p1 =Pattern.compile(reg);
6Matcher m1 = p1.matcher(test);
7 while(m1.find()){
8 System.out.println("Matching result:"+m1.group(0));
9 }
Copy the code
Output results:
1Text:61762828 176 2991 87321
2Greedy mode :(\d{1,2})(\d{3,4})
3Matching result:617628
4Matching result:2991
5Matching result:87321
Copy the code
- “617628” matches 61 from \d{1,2} and 7628 from the following
- \d{1,2} matches 2; \d{1,2} matches 991
- “87321” matches the preceding \d{1,2} for 87 and the following for 321
2. Laziness (not greed)
Lazy matching: When a regular expression contains qualifiers that accept repetition, the usual behavior is to match as few characters as possible (so that the entire expression can be matched). This is called lazy matching. Feature: Matches from left to right, starting from the leftmost part of the string. Every time you try not to read a character to match, the match is completed. Otherwise, a character is read and matched again, and so on (read a character, match) until the match is successful or the string is matched.
The lazy quantifier is the greedy quantifier followed by a “?”
code | instructions |
---|---|
*? | Repeat as many times as you like, but as little as possible |
+? | Repeat 1 or more times, but as little as possible |
?? | Repeat 0 or 1 times, but as few times as possible |
{n,m}? | Repeat n to m times, but repeat as little as possible |
{n,}? | Repeat more than n times, but repeat as little as possible |
1String reg="(\ \ d {1, 2}?) (\ \ d {3, 4})";
2 String test="61762828, 176, 2991, 87321";
3 System.out.println("Text:"+test);
4 System.out.println("Greed Mode:"+reg);
5 Pattern p1 =Pattern.compile(reg);
6 Matcher m1 = p1.matcher(test);
7 while(m1.find()){
8 System.out.println("Matching result:"+m1.group(0));
9 }
Copy the code
Output results:
1Text:61762828 176 2991 87321
2Greed mode :(\d{1,2}?) (\ d {3, 4})
3Matching result:61762
4Matching result:2991
5Matching result:87321
Copy the code
Answer:
“61762” is the laziness on the left that matches a 6, and the greed on the right that matches a 1762. “2991” is the laziness on the left that matches a 2. “87321” is the laziness on the left that matches an 8, and the greed on the right matches a 7321
5. Antisense
The previous metacharacters are all about matching something. Of course, if you want to do the opposite and don’t want to match certain characters, the re also provides some common antisense metacharacters:
metacharacters | explain |
---|---|
\W | Matches any character that is not letters, digits, underscores, or Chinese |
\S | Matches any character that is not whitespace |
\D | Matches any non-numeric character |
\B | Matches are not at the beginning or end of words |
[^x] | Matches any character except x |
[^aeiou] | Matches any character except the letters aeiou |
Regular advanced knowledge is at this point, the regular is a broad and profound language, actually learn to some of its grammar and knowledge is not too difficult, but want to do real knowledge can write very 6 regular, the distance is far, if you really interested in it, and often study and use it, will gradually understand its extensive and profound, I’ll take you as far as this. The rest, you’re on your own.