Now that we understand what regular expressions do and the basic rules, we can begin to read regular expressions.

Many students, when reading other people’s re, will not understand, very dizzy, I started the same. In fact, the essence of regular reading and writing lies in “splitting” and “combining”, like alchemy. Because regular expressions are composed of a series of elements in a certain way, I have divided them into the following 6 categories according to their characteristics:

  1. Metacharacter – quantifier class
  2. Metacharacters – special classes
  3. Escape class
  4. Feature modifier Pattern modifier
  5. Capture subgroups & references
  6. Non-captured subgroup: one-time subgroup & assertion

This classification is to help you understand, re is a tool, is to use, do not memorize. We need to understand how they’re made up, and then break them down and combine them.

Metacharacter – quantifier class Repetition

Let’s start by recognizing some of the basic elements of re. Some characters in a regular expression are given special meanings called metacaracters, so that they no longer simply represent themselves. They can be understood as special codes in the regular expression. Let’s start with the most common quantifiers.

Quantifiers, which in the re mean repetition, only refer to the object immediately preceding it, which may be a character, which may be a subgroup, and this is important, and beginners often get this wrong.

metacharacters meaning
* 0 – >n times to repeat the specified feature;The quantifiers are greedy, which matches as many characters as possible up to the maximum number of matches allowed without causing the match to fail.
+ 1->n times to repeat the specified feature;
It has other usesThe + followed by the quantifier is possessive. It eats as many characters as it can, and doesn’t care about the other features behind it. It can use the qualifier (+) to improve the speed. It works much like a one-time subgroup. > < span style = “box-sizing: border-box; color: RGB (74, 74, 74)
For example,.* ABC matches "aabc", but.*+ ABC does not match because.*+ eats the entire string, leaving the rest of the pattern unmatched.Copy the code
metacharacters meaning
{} Represents a custom quantifier, such as {5,10}, both values must be less than 65536, and the first number cannot be empty, and the first number must be less than or equal to the second, if the second number is omitted, but the comma still exists, it means there is no upper limit {5,}; If both the second number and the comma are omitted, then the quantifier qualifies a certain number of matches {5};
[aeiou]{3,} matches at least three consecutive vowelsCopy the code
metacharacters meaning
? A quantifier that matches the preceding character zero or one times.
It can also be placed after a quantifier to change the greedy nature of the quantifier so that instead of matching as many characters as possible, it matches as few characters as possible to minimize the number of matches
Here's an example in PHP's official documentation, which is well written: /\*.*\*/ // use this expression to match the following strings, /* any of 0 or more characters */ // * first comment*/ not comment /*second comment*/ By default matches the full text. If you change the expression to /\*. /* first comment*/Copy the code

To fill in the blanks from the previous section, where I mentioned greed in a broad sense, here are two examples:

  1. I’m going to use the expression ‘a*? 3’ to match ‘a3’, it should only match the number 3, but it will match a3 as well;

  2. Here’s another one: ‘\d+? A ‘, to match ‘133456789654a’, will match the whole character even if lazy mode is used, in order to try to match it;

This is greed in the broad sense, the maximum character match for the entire expression, and this is always in effect.

Alternatively, we can construct an infinite loop with no upper bound by a subpattern that does not match any characters followed by an *. Such as: (a)? *, this is a supplementary little tip.

Metacaracters – Special class meta-characters

The following metacharacters are commonly used:

metacharacters meaning
[] Set of character classes,Pay attention to:This contains a set of characters.There is no order. Matches a single character in the target string, not a string of character characteristics, as in [.?!]. It matches a single punctuation mark;
I introduced it first because the same metacharacters have different meanings inside and outside the character class []. Most metacharacters have lost their special meanings inside the character class [], except for the following three: For example, [^()] represents all non-() characters; for example, [^()] represents all non-() characters; [^aeiou] Matches all characters that are not vowels. [-] Indicates the range of characters. For example, [w-c] is equivalent to [wxyzABC] in case of case insensitive. [\] escapes, this is the same as outside [], while some escape characters can also be used inside [], such as \d,\ d \w,\ wCopy the code

Next look at the special metacharacters outside the character class [] :

metacharacters meaning
\ Backslashes are usually used as escapes, but we’ll talk about that in the next video.
^ The beginning of a sentence, used as a match position anchor, (or the beginning of a line in multi-line mode)
$ The end of a sentence, used as a match position anchor, (or in multi-line mode, the end of a line)
. Period, matchExcept for the newline characterAny character of (default)
| The left and right ends of the vertical line are two parallel optional features, optional branches.
Optional branches Note the following three points: An optional path that allows matching an empty string; The matching process tries each alternative path from left to right, using the first one that matches successfully, so be careful about the order of branches when using it; Symbol in front of all content (not follow) and later, as the two branches, if use in the subgroups (|), is behind the front all subgroups of content and | | the composition of the two branches;Copy the code

escape

After the basic element metacharacters are introduced, the problem arises. What if I want to match an asterisk *, and I don’t want it to represent the quantifier, and I want it to revert to its original meaning?

Here we use the most common character in the re, the backslash \ escape character. Understanding what it means is crucial for us to read and write re.

character meaning
\ Restores the original meaning of the following non-letter or non-number. \*, for example, means an * sign, not a quantifier. Cancels the special meaning the character represents. Both inside and outside the brackets []. So to match a backslash \, the regular expression is \ \

In addition to restoring the meaning of the metacaracter itself, backslashes can also be used to describe specific characters.

Each of the following pairs of escape sequences represents two disjointed parts of the complete character set. No character can match two at once. All lower case sequences represent yes, and all upper case sequences represent no.

character meaning
\d Digit
\D Non-decimal digits,
\s Space White space character, small S
\S Non-whitespace characters, big S
\w Word Word character,Note: The word character refers to any letter, number, or underscore.
\W Non-word characters, often used[^\W\d_]To indicate that only letters are matched
\b Boundary the boundary of a word, such as an expression"\ bweb \ b"Mark a word boundary so that only the individual words “web” will be matched, but neither “webbing” nor “cobweb” will
\B Non-word boundary, note that if the target string has several words and Spaces when the word boundary is formed! = String bounds
For example: "Tom is a cat", can use the expression ^ (\ w + | \ s) * $to match the entire string, but if you use \ b. * \ b can only match from TomCopy the code

These sequences of character classes can occur either inside or outside the brackets []. They match one character at a time from the character type they represent.

Here are some special escape characters that are often used:

character meaning
\A The start position of the target
\Z The end of the target or a line break at the end
\z End position of the target
Special instructions \A, \Z, \Z assertions are different from ^ and $in that they can always be used in any pattern to match the start and end of the target string without being restricted by the attribute modifier.Also, these tags cannot appear in the character class [] and have no effect

Re also has relative matches for hidden formats in strings, but these characters have no special meaning in the character class [] and are treated as normal characters:

tabs meaning
\n A newline
\r A carriage return
\R Newline: can match \n, \r, \r\n; ‘/^\R$/’ can match any newline character
\t Horizontal TAB
[\b] Return character to avoid conflict with \b

Feature Modifier Pattern Modifier

In the first section of basic re rules, we mentioned that re is case-sensitive by default, matches single-line strings by default, and quantifier matches are greedy by default. So what if you want to be case-insensitive, match multiple lines, or change its greedy property?

This is where the attribute modifier comes in. There is a special class of characters in the re that are used to modify the entire matching rule. Here are some of the more common feature modifiers:

The modifier Corresponding mode meaning
i PCRE_CASELESS Case insensitive The default is sensitive
m PCRE_MULTILINE The default pattern for matching multi-line strings is single line. When this modifier is set, “head of line” and “end of line” will match before or after any newline in the target string, as well as at the beginning and end of the target string, respectively.
s PCRE_DOTALL Make metacharacters that can match any character including a newline
x PCRE_EXTENDED Blank data characters in expressions that are not escaped or not in the character class [] are always ignored to help avoid misoperations
U PCRE_UNGREEDY Cancel the greedy mode, in which case all quantifiers are non-greedy by default. However, a single quantifier can be identified by following one? To make it greedy. In other words, the /U option reverses the default behavior of greed.

Attribute modifiers can be used in three ways:

scenario methods The effect
The main expression The ending separator is followed by the pattern modifier Affects the matching effect of the entire expression
The main expression (? Follow the modifier) Features that affect the rest of the main expression
subgroups (? Follow the modifier) Characteristics that affect the rest of the subgroup
For example: '/CAT[AEiou]/ I ', the I after the/delimiter makes the entire expression case-insensitive. I) [AEiou] / ', will (? '/(a(? I) Bc | d)/' can match the aBC, aBC, abD, etc., this influence will penetrate to the alternative branchesCopy the code

Attribute modifiers can be used side by side, such as (? Im) indicates multi-line mode and is case insensitive. You can also use it to cancel these Settings, such as (? Im-sx) is set to disable both PCRE_DOTALL and PCRE_EXTENDED modes.

Ok, to summarize, in this section we’ve looked at some of the basic elements of re: metacaracters, quantifiers, and their greedy nature; Metacaracter special classes, especially the difference between inside and outside parentheses of the character class []; And the role of escape characters, and some special escape characters. Finally, how to modify the default properties of the re with the attribute modifier.

Summary is not easy, do not reprint privately, otherwise knock to the end.

Resources: www.php.net official documentation