define

Regular Expression a formula that uses a pattern to match a class of strings. It is mainly used to describe the tool for matching strings.

Matching text or characters that exist in more than one part satisfies a given regular expression, and each such part is called a match. There are three types of matches:

  1. The matching of adjectives

That is, a string matches a regular expression. 2. Nominal matching matches regular expressions in text or strings. A nominal match is the part of a string that satisfies a given regular expression

metacharacters

Metacharacters are a very special class of characters that can match a position or acharacter in acharacter set. Metacharacters can be divided into two types

  • The metacharacter of the matching position
  • Metacharacter of matching characters

Metacharacters can only match one character position, that is, a match is in units of a character, not a string

The metacharacter of the matching position

character instructions
^ Start of matching line
$ End of matching line
\b The beginning or end of the matched word does not support Chinese

test

  • ^a

Matches the first letter asaA line

  • a$

Matches the last letter asaA line

  • ^a$

The match has only one letteraA line

  • \bStr

Match withStrIs the beginning of the word

  • ing\b

Match withingIs the final word

  • \bString\b

Matches onlyStringThe word

How does the \b character recognize which one is a word? Strings separated by punctuation marks or Spaces will be recognized as words, and \b can only be used in English, not Chinese

Metacharacter of matching characters

Metacharacters are matched by a single character

character instructions
.(dot) Matches any character except newline
\w Matches any word character (letters, numbers, underscores)
\W Matches any non-word character
\s Matches any whitespace character (space, TAB, line feed, Chinese full space)
\S Matches any non-whitespace character
\d Matches any number from 0 to 9
\D Matches any non-number

test

  • .

Full character matching

  • \w

All word characters are matched, punctuation marks and Chinese characters except underscores are excluded

  • \W

The matching result is the opposite of \w. Note thatUnderscores are word characters

  • \s

There are two Spaces that match,Attention! There are a total of six symbols matched here, except for two Spaces and 1 to 4 newlines at the end of each line

  • \S

All but two Spaces and four newlines are matched

  • \d

Match all the numbers

  • \D

Matches all characters other than digits

Metacharacter combination

Metacharacters alone can be freely combined to achieve different matching effects

  • \w\w

Matches two consecutive word characters

  • \w\s

Notice the last match on the third linemIs matched successfully with the newline character at the end of the line

Text matching

Character classes

A character class is a collection of characters that, if any character in the collection is matched, finds that match.

  • []

Character classes are marked by brackets, and character sets are grouped within brackets to match any character in brackets

  • \ -

When \- is not the first character, it defines the range of characters. For example, [1-3] means [123], [a-z] means to match any lowercase letter, and if \- is the first character, it only represents itself. The order of the range and the contents between the two characters is determined by the order of the ASCII table. For example, [9-1] is an expression that violates the order of the ASCII table, which results in an error

  • ^

If placed first, it indicates the negation of the character class. [^123] matches all numbers except 1, 2, and 3

  • Metacharacters do nothing special in the character class and simply represent themselves

test

  • [aeiou]

Match vowels

  • [a-z]

Matches all letters from a to Z

  • [^aeiou]

Matches all characters except vowels (including any symbols)

  • [^a-z]

Matches all characters except letters between lowercase letters A and Z

  • [0-9]

Matches any number between 0 and 9

  • [ao-u]

Matches all letters from A and O to u, and you can see that as long as you use a hyphen – between the hyphen characters, it is determined to be a range

  • [! -?]

Match from! To arrive? You can see that since it is judged by ASCII table order – join the set of characters, so it is no problem to write

  • [a^-]

Matches the characters A, ^, and -, as long as the ^ symbol is not in the first place and the – symbol is not between two characters, then they represent themselves

  • a[no]

Matches the string an or ao, which is a combination of character classes and metacharacters

Character escaping

The metacharacters introduced earlier are very useful, but this raises the question, what if we want to use the metacharacters themselves in normal expressions? Does it have to be in a character class every time? Of course not, which brings us to character escape

Regular expressions define special metacharacters, such as ^, $, and periods. Because these characters are interpreted to have other specified meanings in the regular expression, if you need to match these characters, you need to use character escape to solve the problem. Escape characters use the symbol \(backslash), which can cancel these characters (such as ^, $, etc.). Etc.) have special meaning in an expression.

  • .

Match character.

  • *

Match character *

  • \

Match character \

  • www\.lanyuanxiaoyao\.com

Matches the string www\.lanyuanxiaoyao\.com, the dot symbol in the url also needs to be escaped

Common escape characters

Character or expression instructions
\a Ring the bell (alarm)\u0007
\b In regular expressions, word boundaries are represented. If in a character class, represents a backspace character\u0008
\t tabs\u0009
\r A carriage return\u000D
\v Vertical TAB character\u000B
\f Page identifier\u000C
\n A newline\u000A
\e The fallback (ESC)\u001B
\ 040 Match ASCII characters to octal numbers (up to 3 bits)
\x20 The hexadecimal format matches ASCII characters
\cC ASCII control characters, such as CtrL-C
\u0020 Matches Unicode characters using a hexadecimal representation (exactly 4 bits)

qualifiers

This is an important point that qualifiers are used to specify how many times a particular character or character set itself can be repeated.

Character or expression instructions
{n} Repeated n times
{n,} Repeat at least n times
{n,m} Repeat at least n times and at most m times
* Repeat at least 0 times, equivalent to {0,}
+ Repeat at least once, equivalent to {1,}
? Repeat 0 or 1 times, equal to {0,1}
A \ *? Repeat the first match as little as possible
+? Use repetition as little as possible but at least once
?? Use 0 repeats (if possible) or 1 repeat
{n}? Equivalent to {n}
{n,}? Use repetition as little as possible, but at least once
{n,m}? Between n and m, use repetition as little as possible

test

  • a{3}

  • a{2,}

  • A {2, 3}

  • ab+

  • ab?

  • ab*

  • ab+?

  • ab??

  • ab*?

Greedy versus lazy

If the qualifier *, +,? , {n}, {n,} and {n,m} add another character? Is repeated as few characters as possible? The number of repetitions of the previous qualifying symbol, which is called lazy matching, as opposed to if there is no character? , using only a single qualifier *, +,? , {n}, {n,} and {n,m} matches are called greedy matches. It may seem complicated, but it has historically been easy to understand that the lazy matching pattern only matches the string that matches the shortest expression, and the greedy matching pattern only matches the string that matches the longest expression. The greedy mode and the lazy mode have different names in different tutorials or instructions, so you can understand the meaning

test

  • Greed modea.*b

You can see there’s only one match, the whole string, because that’s the longest match

  • Lazy modea.? b

Here, as soon as a match is found, it completes the current match, and then starts the new match from the next character, so there’s going to be four matches

Character operation

replace

Replacement using characters |, said if a character in a string matching the expression | the rules of the left or right, then the string will match the expression | said the meaning of “or”, or “logic” of this symbol and code are the same, better understand the matching is based on the principle of the left first, that is, from left to right, When the expression on the left is not satisfied, the expression on the right is tried

test

  • a|b

You can see that either character a or b matches this expression, which is equivalent to [ab].



grouping

Group is also known as the expression, namely the whole or part of a regular expression into one or more groups, group use (), the expression in parentheses is a group, a group is a whole Should pay attention to distinguish and characters in [], [123] is the means to match character 1 or 2 or 3, and the string matching (123) 123

backreferences

Set no. When a regular expressions are grouped, the default automatically from left to right every group is assigned a group, the left parenthesis (for separating from 1 began to increase, the first group of number is 1, the second group of number is 2, and so on, in the back of the expression, use \ group number way to refer to the front, For example, in \b(\w)\1\b, the following \1 is a reference to the preceding (\w) group. < name >), (? < word > \ w) and (? ‘word’\w) saves the matched letter \w+ into a group named word. Custom named groups use \k

, as in \b(?

\w)\k

\b is a word that matches consecutive identical two-letter words. Backreferencing provides an easy way to find duplicate groups of characters. Think of it as a shortcut instruction to match the same group of characters again. \b(\w)\1\b in \b(\w)\1\b represents the character \w matches. If \1 matches a, then \1 must be replaced with a, not any letter. So \b(\w)\w\b represents two letters that are allowed to be different


character instructions
(expression) Match stringexpressionAnd saves the matched text to auto-named groups
(? <name>expression) Match stringexpressionAnd name the matching text with name. The name cannot contain punctuation marks and cannot start with a number
(? :expression) Match stringexpression, does not save the matched civilization and does not assign a group number to this group
(? =expression) Match stringexpressionFront position
(? ! expression) A match is not followed by a stringexpressionThe location of the
(? <=expression) Match stringexpressionRear position
(? <! expression) Matches not preceded by a stringexpressionThe location of the
(? >expression) Matches only stringsexpressionAt a time

test

  • (ab)

abIt’s the whole thing, and the individual stringsabThere is no difference between

  • (? <word>ab)\k<word>

theabThis group is calledword, and then calls the characters matched by the previously named group, that is, the expression is equivalent to(ab)ab

  • (? :a)(b)\1

The previous group is unnamed, so automatically named from(b)Here we go, so the back\ 1Matching is(b)

  • b(? =a)

This expression means that the match character b is followed by an A

  • b(? ! a)

This expression means match characterb, this characterbNot behinda

  • (? <=a)b

This expression means match character B, which is preceded by a

  • (? <! a)b

This expression means match characterb, this characterbNot in front ofa

  • (? >a)b

reference

  1. Wang Lei. Journey of Magical Matching Regular Expression Refinement [M]. Beijing: Publishing House of Electronics Industry, 2014.
  2. Regular expression test tool used in this article: Regular expression test tool online debugging and sharing -Zjmainstay
  3. Regulex JavaScript Regular Expression Visualizer is used to generate Regular expressions.

Every day is Debug’s day