References:

  • Matching rule – RegExp object

  • Extension to Re – ES6 tutorial

  • Regular expression visualization

Regular expression is a pattern used to express the structure of a string. It is often used to match a string (that is, text) to determine whether there are parts in the string with a given structure.

Regular expressions in JavaScript borrow from Perl and have built-in RegExp objects to provide regular expressions, which mainly record the matching rules of regular expressions.

1. Character classification

A regular expression is literally a section between diagonal bars and consists of characters, but some of the characters are literal and some of the characters are special.

/regular expression/igm
Copy the code

Note: I g m is a modifier used to attach some rules. The U, Y and S modifiers have also been added to ES6.

We call literal characters literal characters, and characters that have special functions metacharacters.

Metacharacters are recorded as follows:

., ^ and $, |, \,? , *, +, (,), [,], {,}

(1) Dot character (.)

Dot character (.) Represents any character except carriage return (\r), line feed (\n), line separator (\u2028), segment separator (\u2029), and characters with code points greater than 0xFFFF.

Note: The u modifier introduced in ES6 can match characters with code points greater than 0xFFFF.

/a.c/.test('abc') // true
/a.c/.test('abbc') // false
Copy the code

Note that a character refers to a single character, not multiple characters, except for escaped characters, which literally look multiple but actually represent one.

(2) Position character (^ $)

The positional character ^ represents the beginning of the string, while $represents the end of the string.

// There is a part of the string beginning with ABC
/^abc/.test('abcdef') // true

// There are parts of the string that end in def
/def$/.test('abcdef') // true

// The string has only ABC from the beginning to the end
/^abc$/.test('abc'); // true
Copy the code

(3) Selector (|)

Select operator (|) said “or” in the regular expression, such as/cat | dog/cat or dog said matching.

// The string contains cat or dog
/cat|dog/.test('cat') // true
/cat|dog/.test('dog') // true
Copy the code

Note that selector | will contain more characters before and after, rather than a single character.

(4) Escape character (\)

If you want to represent a metacharacter in a regular expression, you have to escape it, that is, after \, as if you wanted to represent the dot symbol (.). :

// There is a.c in the string
/a\.c/.test('a.c') // true
/a\.c/.test('abc') // false
Copy the code

Note that if the regular expression is generated as a constructor, since the first argument is a string, it will escape itself, requiring two escape characters \\.

var r = new RegExp('a\\.c'); // /a\.c/
Copy the code

(5) Set class ([]), off character (^), hyphen (-)

The set class ([]) is used to indicate that there is a list of characters available to match, which is equivalent to a set of characters, putting characters between [and].

// The string contains one of the characters a, B, and C
/[abc]/.test('a'); // true
/[abc]/.test('b'); // true
/[abc]/.test('c'); // true
Copy the code

In the set class, ^ is not a positional character, but a broken character, indicating that no characters in the set class are to be matched.

// The string does not contain any character a, b, or C
/[^abc]/.test('a'); // false
/[^abc]/.test('b'); // false
/[^abc]/.test('c'); // false
Copy the code

Note that the stripper ^ must be in the first position of the collection class. [^] Matches any character.

In the set class, – is also a special character, called a hyphen, and represents a contiguous set of characters.

// The string contains one of the lowercase letters a to Z
/[a-z]/.test('b') // true

// The string contains one character among digits, lowercase letters, and uppercase letters/ [0-9a-fA-F]/.test(1) // true
Copy the code

Note that the hyphen – only works for a single character, so [1-31] represents characters from 1 to 3, not 1 to 31. It also works with Unicode characters.

// The string contains 1, 2, or 3/ [1-31]/.test(4) // false

\u0128 to \uFFFF in the string
/[\u0128-\uFFFF]/.test('\u0130\u0131\u0132') // true
Copy the code

(6) Repeat class{}), quantifiers (? * +), greed mode

The repetition class ({}) is used to indicate the number of repetitions (that is, occurrences) of a character.

  • {n} : the character is repeated n times.

  • {n,} : the character must be repeated at least n times.

  • {n,m} : character repeated n to m times.

// The character o is repeated twice
/lo{2}/.test('look') // true
/lo{2}/.test('lok') // false
Copy the code

Quantifiers can also be used in regular expressions to indicate the number of occurrences of characters.

  • ? : 0 or 1 occurrence of a character, equivalent to {0,1}.

  • * : the character appears at least 0 times, equivalent to {0,}.

  • + : the character appears at least once, which is equivalent to {1,}.

// the character t occurs 0 or 1 times/t? est/.test('test') // true/t? est/.test('est') // true
Copy the code

In general, after specifying the number of characters, the maximum number of characters is matched. This rule is called greedy mode.

// Greedy mode: the character a occurs 1 to 4 times, and returns as many as possible
/a{1.4}/.exec('aaaabc') [0] // "aaaa"
Copy the code

To match fewer, can adopt non-greedy mode by following up behind? You can.

// Non-greedy mode: the character a occurs 1 to 4 times, and returns as least as possible
/a{1.4}?/.exec('aaaabc')[0] // "a"
Copy the code

(7) Unprintable characters

Regular expressions provide expressions for special characters that cannot be printed:

  • \cX: indicates Ctrl-[X], where X is any of the letters A-Z, used to match control characters.

  • [\b] : Matches the backspace key (U+0008), not to be confused with \b.

  • \ R: Matches the enter key.

  • \n: Matches the newline key.

  • \ T: Matches TAB (U+0009).

  • \ V: Matches vertical tabs (U+000B).

  • \f: Match the page feed character (U+000C).

  • \0: Matches the null character (‘ U+0000 ‘).

  • \ XHH: Matches a character represented by a two-digit hexadecimal number (\x00-\xFF).

  • \ uHHHH: Matches a Unicode character represented by a four-digit hexadecimal number (\u0000-\uFFFF).

(8) Predefined characters

Predefined characters are shorthand for some common character matching patterns in regular expressions.

  • \d: Matches any number from 0 to 9, equivalent to [0-9].

  • \D: Matches characters other than 0 to 9, equivalent to [^0-9].

  • \ W: Matches any letters, digits, and underscores, equivalent to [A-za-z0-9_]

  • \W: Matches characters other than letters, digits, and underscores, equivalent to [^ A-za-z0-9_]

  • \s: Matches Spaces (including newlines, tabs, Spaces, etc.), equivalent to [\r\n\t\v\f].

  • \S: Matches characters that are not Spaces, equivalent to [^\r\n\t\v\f].

  • \ B: Matching word boundary, indicating word independence.

  • \B: Match non-word boundaries, indicating that words are not independent.

// word independent
/\bworld/.test('hello world') // true
/\bworld/.test('world hello') // true
/\bworld/.test('hello-world') // true
/\bworld/.test('world-hello') // true
/world\b/.test('world hello') // true
/world\b/.test('hello world') // true
/world\b/.test('hello-world') // true
/world\b/.test('world-hello') // true

// The word world is not independent
/\Bworld/.test('hello world') // false
/\Bworld/.test('helloworld') // true
/world\B/.test('hello world') // false
/world\B/.test('helloworld') // true
Copy the code

(9) Modifier (i g m u y s)

The modifier is placed after the regular expression slash/to append some rules:

  • I: Ignore case

  • G: Global, as long as there is a match in the remaining position, using the string match method to return the contents of each match array, not the contents of the group.

  • M: Multiple lines are allowed, affecting only the positional characters ^ and $.

  • U: Unicode mode for correctly matching characters with code points greater than \uFFFF

  • Y: sticky, to ensure that multiple values can be matched from the first remaining position

  • S: singleline, also known as dotAll mode, dot character (.) Represents all characters, including newline character \n, etc.

// I ignores case
/abc/.test('ABC') // false
/abc/i.test('ABC') // true

// g global match
'abbcbb'.match(/bb/) // ["bb", index: 1, input: "abbcbb", groups: undefined]
'abbcbb'.match(/bb/g) // ["bb","bb"]

// m allows multiple lines
/world$/.test('hello world\n') // false
/world$/m.test('hello world\n') // true

// u Unicode mode
'🐪'= = ='\uD83D\uDC2A' // true
/^\uD83D/u.test('\uD83D\uDC2A') // false
/^\uD83D/.test('\uD83D\uDC2A') // true

// if y is sticky, ensure that multiple values are matched from the first remaining position
var str = 'aaa_aa_a';

var reg1 = /a+/g;
reg1.exec(str) // ["aaa"]
reg1.exec(str) // ["aa"]

var reg2 = /a+/y;
reg2.exec(str) // ["aaa"]
reg2.exec(str) // null

// s singleline, also known as dotAll mode, dot character (.) Represents all characters
/a.c/.test('a\nc') // false
/a.c/s.test('a\nc') // true
Copy the code

2. Group matching

You can use braces () to group multiple characters in a regular expression and match them in groups instead of individual characters.

// The character p must be repeated at least once
/group+/.test('groupp') // true // Character group Repeat at least once /(group)+/.test('groupgroup') // true
Copy the code

(1) capture

When groups are used in a regular expression, the string match method captures the content matched by each group without a global match:

// In a non-global match, match returns an array, with the first element being the whole match and the subsequent elements being the group match
'abc'.match(/ (.). b(.) /) // ["abc","a","c"]
Copy the code

If you want a group not to be captured by the match method, you can use (? 🙂 is called a non-capture group. As follows:

// If a non-capture group is used, the content matched by the corresponding group will not be captured
'abc'.match(/ (? :) b(.) /) // ["abc","c"]
Copy the code

The regular expression can also be used to refer to groups in the form \n (n >= 1 and an integer) to indicate that the group matches:

// The characters before and after character b are the same/ (.). b(.) \1b\2/.test("abcabc") // true

// The parentheses can also be nested, with \1 matching the outer parentheses and \2 matching the inner parentheses/y((..) \2) \1/.test('yabababab') // true
Copy the code

(2) the assertion

Assertions refer to directly specifying what will or will not be before or after a character. There are four main types of assertions:

  • x(? =y): prior assertion (xThe back isyJust match)
// b = c/ab(? =c)/.test('abc') // true/ab(? =c)/.test('ab') // false/ab(? =c)/.test('abd') // false
Copy the code
  • x(? ! y): antecedent negative assertion (xThe back is notyJust match)
// b is not a c/ab(? ! c)/.test('abc') // false/ab(? ! c)/.test('ab') // true/ab(? ! c)/.test('abd') // true
Copy the code
  • (? <=y)x: subsequent assertion (xThe front isyJust match)
// if b is preceded by a/ (? <=a)bc/.test('abc') // true/ (? <=a)bc/.test('bc') // false/ (? <=a)bc/.test('dbc') // false
Copy the code
  • (? <! y)x: subsequent negative assertion (xNot in front of theyJust match)
// b is not preceded by a/ (? <! a)bc/.test('abc') // false/ (? <! a)bc/.test('bc') // true/ (? <! a)bc/.test('dbc') // true
Copy the code

Note that lookahead needs to determine whether the following is true, and lookbehind needs to determine whether the preceding is true. And why the name feels wrong, because the reference is the grouping, I want to match the content in front of the grouping, so it is the first, and vice versa.

(3) to be identified

In ES6, named group matching is proposed to facilitate the reading of matching results.

Add? To the front of the group. < group name >, as in:

/ (? < number > \ d +)/exec ('ab12c3');
/ / / "12", "12", the index: 2, input: "ab12c3 groups: {" digital" : "12"}, length: 2)
Copy the code

As you can see, the groups object takes the name of the named group as the key, the result as the value, and undefined if there is no named group.

Once you have a named group, when you use the string replace method, you can use $< group name > to indicate what the group matches:

let re = / (? 
      
       \d{4})-(? 
       
        \d{2})-(? 
        
         \d{2})/u
        
       
      ;
'2015-01-02'.replace(re, '$<day>/$<month>/$<year>') / / "02/01/2015"
Copy the code

When we want to refer to the next named group in a regular expression, we can use \k< group name > to refer to the group, of course, \n (n >= 1 and integer) also supports:

/ ^ (? <word>[a-z]+)! \k<word>$/.test('abc! abc') // true/ ^ (? <word>[a-z]+)! \1$/.test('abc! abc') // true
Copy the code

(after)