This is the second day of my participation in Gwen Challenge
Constituent elements and methods
Book to back
Using the basic elements described in the previous section, you can now build some fixed-structured expressions yourself. For example, to match any IP address 12.1.242.1, we can use metacharacter + quantifier, such as ^[0-2]? \d? \d\.[0-2]? \d? \d\.[0-2]? \d? \d\.[0-2]? \d? \d$
But if you write 299.299.299.299, it will match, obviously beyond the IP address range, and the code doesn’t look very neat, right? There are a lot of duplicates. So how do you make regex more flexible, more concise, and more effective? So let’s look at that question.
Subgroups ()
The common parentheses () can be nested first, but that doesn’t affect the order of matches from left to right, which is different from the parentheses in the operation.
Its functions are very powerful, I summed up, there are mainly 6:
-
For example, the matching results of \w\d{5} and (\w\d){5} are different. The former matches 1 character and 5 digits in total, while the latter matches 5 characters and 10 digits in total.
-
To alternative branches | localization, alternative branches the default of a whole, such as (cat) (dog) | matching catdog or ant (ant), and after using the subgroup (cat) (dog | ant) matching catdog or catant, alternative branches to the inside of the subgroups of content;
-
Separate capture match results, if you do not use the subgroup, the return value of the regular is the entire match string, and after using the subgroup, in return for the entire string at the same time, also can return the subgroup match results, used to do data extraction is very convenient!
-
Backward reference, which can refer directly to the result successfully captured by the previous face group.
-
According to the judgment condition, choose the alternative branch, this is the application of the condition subgroup.
-
Code comments, directly in the regular expression to comment, by (? @ followed by comments)
There’s nothing to say about 1 or 6. I’ll focus on 2,3,4,5.
Alternative branches | localization
Here an example of a I had made a mistake (a | b) *
I just began to learn the regular, and I’ll understand it into (a * | b *), later found out that if we break up, then there will be a problem, (a * | b *) to match any aaaa or any BBBBB, but cannot match the combination of a and b aabb/abababab;
The correct translation is, (a | b) is to be as a whole to see, the actual equivalent (a | b) * is’ (a | b) (a | b) (a | b)… ‘It can match any combination of ab, and that’s a good thing to understand when we talk about recursive regex.
Catch subgroups and references \g{n}
Catching a Catch means a match is successful. The purpose of capture is for reference. The purpose of citations is to keep the structure simple and avoid repetition.
Subgroups are captured by default, and the match value (captured value) of the subgroup is returned on success. Subgroup sorting is done from left to right, starting at 1.
But when a subgroup is repeated, the result captured for that subgroup is the value captured in the last iteration.
It’s a mistake not to understand this properly.
For example, the expression ‘\w*(\d{3})’ matches abc123456 and returns 456 for the subgroup (\d{3}) as well as the entire string; \w*(\d){3} returns 6, except for the entire string.
Ok, it’s time for the reference. Reference means that the value will be captured and then called. In the re, references are backward, meaning that the referenced content must appear before the reference symbol.
It is referenced by method \n or \gn or \g{n}, where n represents the ordinal number of the subgroup. I recommend \g{n} because the structure is clearer
N can be positive, 0 or negative. If n is positive, it means the NTH subgroup of the positive number starting from the expression. If n is negative, it is the NTH subgroup backwards from \g{n}. For example :(foo)(bar)\g{-1} matches the string “foobarbar”, and (foo)(bar)\g{-2} matches “foobarfoo”.
If n=0, \g{0} means referring to the expression itself, which we’ll talk about in recursion regex.
4 points to note when quoting:
-
If a subgroup cannot get a match, then any backreferences to that subgroup will also fail.
-
An error will be reported if there are not enough capture groups in the expression.
-
If one of the previous subgroups was a “non-capture subgroup”, the sorting is skipped without counting.
-
Backreferences the matching result of the front face group, not the feature, as shown in the following two examples:
(sens | respons e) and {1} \ g ibility would match "sense and sensibility" and "the response and responsibility", It doesn't match sense and responsibility. ((? I)rah)\s+\g{1} matches "rah rah" and "rah rah", but not "rah rah" or "rah rah"Copy the code
Let me take a more specific example of a reference: a backreference to an optional branch. Regex can be used by nesting references directly into optional branches to match a recursive sequence, which is interesting!
Expressions, a | b \ g {1}) +, this is a since the reference case, see us apart,
-
The first round of the match: at this time only matching string a, because there were no assignment {1} \ g, because the objects it references (a | b \ g {1}) has not been matched, and branch b \ g {1} no meaning not match;
-
Second round match: At this time of the {1} \ g value has been set as a, now this expression can be translated into a (a | b (a)), can match the string aa or aba, aa, if matching so \ g {1} after each round of values are a, this expression can match any a, but if the match aba, So \ {1} g value can be set to ba, which is (a | b) (a) a second round of capturing value, then we could look at the third round;
-
The third round match: \ g {1} this time value is set to ba, now this expression can be translated into aba (a | b (ba)), can match abaa or ababba, here should be can see law, this expression can match ababbabbbab… The recursive sequence of a.
Condition subgroups
Conditional subgroups give us greater matching flexibility, which will match different features according to the judgment of conditions.
By means of (? (condition) yes the pattern | no pattern), if condition met, then the yes pattern, if does not meet the execution no pattern, no pattern can also be empty, will not perform any match. An error is reported if there are more than 2 optional subgroups.
Condition can be used in three different ways:
methods | model | meaning |
---|---|---|
(? (n)pattern) | Numeric reference | Take the captured value of the previous subgroup as the judgment condition. If the subgroup is matched successfully, the value is true. N is an integer, which can be positive or negative. The meaning of n is the same as that of a backward reference. |
(? (R)pattern) | Recursive reference | The R here, which refers to the entire expression, will be true if the expression is called recursively, but the condition is always false on the first round of matching, because the first round has not yet been called recursively; |
? (claim), the pattern) | Use assertions as conditions | You can make any kind of assertion, forward, backward, positive, negative, and we’ll talk about that in the next section. |
For example numeric condition: (\))? [^ ()] + (? (1)\)) This expression matches a string without parentheses or wrapped in closed parentheses. Split as follows:
(\))? // Add an open parenthesis and set it to the capture value
[^()]+ // Matches one or more non-parenthesis characters
(? (1)\)) // is a conditional subgroup, which tests whether the first subgroup matches. If a match is found, that is, the target string starts with an open parenthesis, and the condition is true, then yes-pattern means that a close parenthesis is matched. In other cases, since no-pattern does not appear, the subgroup does not match anything.
For example, condition: A(? (R)B)(? R)? C matches AC or AABCC.
In the first round, the character A is matched. Conditional subgroups (? (R)B) won’t match any values because it hasn’t recursed yet; (? R)? Zero or one recursion, so we’re going to do a placeholder; The last match is a character C, so the first match is A_C;
In the second round of matching, the whole expression is recursed, first matching character A again; And then (? (R)B) true; Here we have recursed once, so skip (? R), directly match C, insert the second round of matching ABC into the previous placeholder, the result is AABCC;
Ok, there’s a little bit more about subgroups, so I’m going to talk about it in two parts, but just to sum it up, we’ve learned about subgroups and what they do, we’ve learned about capture and backward referencing, and what to look for when you use them, and finally we’ve talked about conditional subgroups, and a little bit about recursive referencing.
Let’s go back to the original question, how to match any IP address with a re, and apply what we learned today.
- The first step is to analyze the target basic structure. IP addresses range from 0.0.0.0 to 255.255.255.255. After learning the reference knowledge, we only need to solve the first field and reference the following three fields directly, without repeating them.
- Step two, set limits. So 0-255 needs to be limited, otherwise we’re going to have the problem that we started with, and we’re going to use the subgroup + alternative branch structure, and we’re going to look at the range 200-255, which can be written as
2(5[0-5]|4\d)
- Now let’s look at the range from 0 to 199, so let’s just write it as a simple
[01]? \d? \d
, then 299 will match to 99, and an uncompliant part of the data will be extracted into a compliant part. We want the effect of non-compliant data reporting errors or complete mismatches. Then we need to make stronger restrictions on the target number and range. - The third step is the number and range constraint. Let’s look at the 3-digit case. The range 100-199 can be fixed with 1\d\d. The range 10-99 can be fixed with [1-9]\d, the last digit \d. This is not enough, we also need to add two anchors before and after:
^(1\d\d|[1-9]\d|\d)\.
With two anchors and a fixed number of digits, we limit the number of digits and range of the object and do no matching for values outside the range. - One special case to note here is 000,010, which starts with 0 and is matched by 0\d{2}.
- Step 4: Combination. We set up five alternative branches, putting the most complex branch first
^(2(5[0-5]|4\d)|1\d{2}|0\d{2}|[1-9]\d|\d)\.
The next three groups are recursive calls to the first subgroup, resulting in the final result:^^(2(5[0-5]|4\d)|1\d{2}|0\d{2}|[1-9]\d|\d)(\.(? {3} $1))
- Translation: I’m using a recursive reference here, which refers to the expression of the first subgroup. There is no backward reference because the object referenced backward is a subgroup capture value, and it would be wrong to reference the capture value here. In addition (.(? 1)){3} put them together and repeat three times, using the combination of subgroups and quantifiers mentioned at the beginning of this section.
I will talk about recursion in detail in the future, see here has not given up, the following content will be more exciting!
Learning is not easy, please do not reprint without permission, otherwise don’t blame the old man you are welcome.