Regular expressions (regex or regexp) are useful for extracting information from any text by retrieving one or more matches of a particular pattern (that is, a particular SEQUENCE of ASCII or Unicode characters).

Application: parsing, replacing string, data format conversion, and web crawling.

The funny thing is, once you learn the grammar, You can use this tool in almost any programming language (JavaScript, Java, VB, C #, C/C++, Python, Perl, Ruby, Delphi, R, Tcl, and many others), with only minor differences.

Let’s take a look at some examples and parsing.

The basic grammar

Boundary matching – ^ and $

  • ^The matches any string beginning with The -> Try it!

  • End $matches any string ending in end

  • ^The end$matches The string exactly (The start, end)

  • Roar matches any string that contains roar

Quantifier – * +? and {}

  • ABC * matches ab followed by zero or more c strings [0, +∞] -> Try it!

  • ABC + matches the string after ab one or more times c [1, +∞]

  • abc? Matches the string [0, 1] after ab zero or once c

  • ABC {2} matches the string of the second c after ab

  • ABC {2,} matches the string c after ab 2 or more times

  • ABC {2,5} matches the string c after ab 2 to 5 times

  • A (BC)* Matches the string that follows a zero or more times BC

  • A (BC){2,5} matches the string BC two to five times after A

Or – | the or []

  • A (b | c) string matching a followed by b or c

  • | c a [b] in accordance with the above

Character class – \d \w \s and.

  • \d match a number -> Try it!

  • \w Matches a character (letter, digit, character, underscore) -> Try it!

  • \s matches a space (including tabs, newline \n)

  • . Matches any non-null character (excluding null characters such as newline \n) -> Try it!

  • [\s\ s] matches any character

Use. Metacharacters with caution, because character and negation classes are faster and more accurate.

\d, \w and \s use \d, \w and \s respectively to indicate their negation.

For example, \D will match the opposite character.

  • \DMatch aThe digitalThe character of – >Try it!

In order to correctly understand, you must use a backslash to escape character \ ^. [$() | * +? {\, because they have special meaning.

  • $\dMatch a$followed by a numberThe character of

Note that you can also match a non-printable character such as tabs \t, newline \n, carriage return character \r.

The modifier

We are learning how to write a regular expression, but have forgotten one basic concept: modifiers

Regular expressions are usually of the form/ABC /, where matching patterns are separated by two slashes. We can specify a flag with the following values at the end of them (or use them in combination).

  • G (global) : returns no result after the first match, continues to search after the last match, and finally returns all matches (global match).
  • m (multi-line): when enabled^And $will match the beginning and end of the line, not the entire string.
  • i (insensitive): Makes the entire expression case insensitive (for instance)/aBc/i would match AbC)

Es6 new

  • Y (sticky) : similar to the g modifier, y (sticky) is a global match. The next match starts from the next position of the last match. The difference is that the G modifier is ok as long as there is a match in the remaining position, while the Y modifier ensures that the match starts at the first position in the remaining position, which is what “bonding” means.

    var s = 'aaa_aa_a';
    
    var r1 = /a+/g;
    var r2 = /a+/y;
    
    r1.exec(s); // ['aaa']
    r2.exec(s); // ['aaa']
    
    r1.exec(s); // ['aa']
    r2.exec(s); // null
    
    var r = /a+_/y;
    
    // try again
    r.exec(s); // ['aaa_']
    r.exec(s); // ['aa_']
    
    r.sticky // true
    Copy the code
  • U (Unicode) : “Unicode” mode, used to correctly handle Unicode characters larger than \uFFFF. That is, four bytes of UTF-16 encoding can be handled correctly.

    var s = '𠮷';
    
    $/ / ^.test(s); // false
    /^.$/u.test(s); // true
    Copy the code
  • S: Match Italian characters, we know the dot (.) Special characters represent any character except “line terminators” (eg.\n,\r, line separator, segment separator), s modifiers can contain all characters, which is called dotAll pattern, i.e. dot represents all characters.

    var s = '𠮷';
    
    /. * /s.test(s); // true
    Copy the code

Intermediate grammar

Grouping and capturing — ()

  • The a(BC) brackets create a capture group with the value BC -> Try it!

  • a(? Used: BC) *? : Disable capture groups (non-capture groups) -> Try it!

To understand? You need to understand the concepts of capture and non-capture groups:

() represents the capture group, and () saves the matched value of each group, using $n(n is a number representing the contents of the NTH capture group);

(? 🙂 represents a non-capture group. The only difference is that values matched by a non-capture group are not saved.

Es6 new

  • a(? < foo >bc)use? <foo>Name the capture group. ->Try it!

If you name the capture group (using?

), we will be able to use the matching result in groups to look up the value of the captured group, the key being the group name.

This operator is useful when extracting information from strings or data. When using multiple capture groups to match data,

We will use the index of the matching result to access their value ($n), also accessible by named group groups.

var string = '1999-12-31';
const matchObj = string.match(/ (? 
      
       \d{4})-(? 
       
        \d{2})-(? 
        
         \d{2})/
        
       
      );
// ["1999-12-31", "1999", "12", "31", index: 0, input: "1999-12-31", groups: {day: "31", month: "12", year: "1999"}]

const newStr = string.replace(/ (? 
      
       \d{4})-(? 
       
        \d{2})-(? 
        
         \d{2})/
        
       
      .'$<day>/$<month>/$<year>')
/ / 31/12/1999

const newStr2 = string.replace(/(\d{4})-(\d{2})-(\d{2})/.'$3 / $2 / $1')
/ / 31/12/1999

// We can also use named group matching inside regular expressions \k< group name >
const RE_TWICE = / ^ (? 
      
       [a-z]+)! \k
       
        $/
       
      
RE_TWICE.test('abc! abc') // true
RE_TWICE.test('abc! ab') // false
Copy the code

Brackets — []

  • [ABC] match a or b or c, equivalent to a | b | c – > Try it!

  • [A-C] is consistent with the above

  • [a-FA-f0-9] Matches a hexadecimal character, case insensitive. -> Try it!

  • [0-9]% matches a string from 0 to 9 before %

  • [^ a-za-z] matches a letter that does not go from A-z or a to Z. In this case ^ is used in the negative. -> Try it!

Note that all special characters (including backslashes \) lose their special functionality in parenthesis expressions, so do not use the “escape” function

Greedy and Lazy match

Quantifiers (* + {}) are greedy matches, so they extend the match as much as possible with the text provided.

For example, using <.+> to match

simple div

, it returns the entire text

simple div

To match only one div tag, we can use? Make it lazy match

  • The <. +? >matching<and>Contains one or more characters, expanded as needed. ->Try it!

Then, better regular schemes should be avoided in favor of using more stringent patterns:

  • < [^ < >] + >matching<and>Any character contained within. ->Try it!

Advanced grammar

Boundary — \b and \b

  • \babc\bIf there are no characters before or after ABC, the command is executedWhole words onlyMatch – >Try it!

\b represents the position of a boundary (similar to ^ and $) where one side is a word character (such as \w) and the other side is not a word character (for example, it may be the beginning of a string or a space character eg: \b123).

It also has negation, \B. This matches \b all mismatched positions and can be matched if we find a matching pattern completely surrounded by word characters.

  • \Babc\BperformBoth sides of ABC are surrounded by charactersMatch – >Try it!

Return the reference -\ 1

  • ([ABC]) use 1 \ \ 1 returns the same as the first capture group match match = = ([ABC]] ([ABC]) – > Try it!

  • ([ABC]] ([DE]), 2, 1 in accordance with the above, use \ 2, \ 1 returns and capture the second group, the first to capture the same set of matching match = = ([ABC]) ([DE]] ([DE]] ([ABC]), and so on – > Try it!

  • (? < foo >[ABC])\k< foo > We put the name foo into the capture group and can reference it with \k, and the result is the same as the first re == ([ABC])([ABC]) -> Try it!

Forward (pre – assertion) and backward (post – assertion) — (? =) and (? < =)

Firefox is currently not compatible, encountered once, please note

  • d(? =r) matches d only if d is followed by r, but r does not become part of the entire regular expression -> Try it!

  • (? <=r)d matches d only if d is preceded by r, but r does not become part of the entire regular expression -> Try it!

You can also use negation operators.

  • d(? ! R) matches d only if d is not r after d, but r does not become part of the entire regular expression -> Try it!

  • (?
    Try it!

Usage is introduced

Note: Pattern is an instance of RegExp, and STR is an instance of String

usage instructions The return value
regexp.test(str) judgestrWhether to contain matching results Contains the returntrue, does not include returnsfalse
regexp.exec(str) According to theregexprightstrPerform regular matching Returns an array of match results, if no match is foundnullThe difference with match is that it returns more complete matching information
str.match(regexp) According to theregexprightstrPerform regular matching Returns an array of match results, if no match is foundnull
str.replace(regexp, newSubStr \ function) Break down According to theregexp / stringrightstrPerforms a re match and replaces the match result withnewSubStr \ Return value of function Return the replaced string.
str.search(regexp) According to theregexprightstrPerform regular matching Returns the position of the first match
str.split(regexp) In order toregexpIs the delimiter, yesstrCut into arrays Returns the cut array

Test /exec Precautions

If the regular expression sets the global flag /g, the execution of test() changes the lastIndex property of the regular expression. Successive executions of the test() method will match the string starting at lastIndex (exec() also changes the lastIndex property of the re itself).

The following example shows this behavior:

const digits = /\d+/g;

digits.test("Hello world! 123"); // true
digits.test("321"); // false
digits.test("321"); // true
Copy the code

You can hack like this:

const digits = /\d+/g;

digits.test("Hello world! 123"); // true

digits.lastIndex = 0;
digits.test("321"); // true

digits.lastIndex = 0;
digits.test("321"); // true
Copy the code

For details, please refer to MDN

The replace,

  • grammar
str.replace(regexp|substr, newSubStr|function)
Copy the code
  • parameter

    • regexp (pattern)

      A RegExp object or its literal. What the re matches is replaced by the return value of the second argument.

    • substr (pattern)

      A string to be replaced by newSubStr. It is treated as an entire string, not as a regular expression. Only the first match will be replaced.

    • newSubStr (replacement)

      A string used to replace the matching part of the first argument in the original string

    • function(a, b, c, d) (replacement)

      A function that creates a new substring whose return value replaces the result of the first argument.

      • A: Match

      • B: matched capture group

        If there is no capture group, this parameter is not available. If there are multiple capture groups, multiple parameters b, C, D,e… ;

        If a capture group is repeated several times, the parameter of the capture group is the result of the last match. For example: (\ d) +

      • C: Index of the match in the original string

      • D: Original string

        The last two arguments are always the match index and the original string

  • If you are still confused about replace, take a look at the following example

conclusion

As you can see, regular expressions are widely used, and I’m sure you’ve seen the rule at least once in your development career. Here’s a list of its applications:

  • Data validation (for example, checking that the time string is properly formatted)
  • Data fetching (especially web fetching, finding all pages containing a particular set of words, and finally ordering them in a particular order)
  • Data wrapping (converting data from “raw” format to another format)
  • String analysis (for example, capturing all URL GET parameters, capturing a set of text in parentheses)
  • String substitution (for example, even if a common IDE is used to convert Java or C classes in a code session) into the corresponding JSON object {– replace “;” With “, “to make it lowercase, avoid type declarations, etc.).
  • Syntax highlighting, file renaming, Packet Sniffing, and many other applications involving strings (where data doesn’t need to be textual)

Have fun and do not forget to recommend the article if you liked it 💚

Appendix: Replace example

  1. Fill in the following two vacancies:
// define
(function(window) {
    function fn(str) {
        this.str = str;
    }

    fn.prototype.format = function () {
        var arg = ____;

        return this.str.replace(____, function (a, b) {
            return arg[b] || ' '; })};window.fn = fn; }) (window);

// use
(function() {
    var t = new fn('<p><a href="{0}">{1}<a><span>{2}</span></p>');
    console.log(t.format('http://www.yonyou.com'.'yonyou'.'Welcome')); }) ();// If you understand the use of replace, it's too easy.
Copy the code
  1. Convert the 87654321 integer to currency $87,654,321 using the re
'87654321'.replace(/(\d)+? (? =(\d{3})+(? ! \d))/g.function(a, b, c, d) {
  return d < 2 ? ("$" + a + ",") : (a + ",");
})
/ / $87654321
// Read more about replace and re

'87654321'.replace(/ ((\ d {1, 3})? =(\d{3})+$)/g.function(a, b, c, d) {
  return d < 2 ? ("$" + a + ",") : (a + ",");
})
/ / $87654321

'87654321'.replace(/ \ d {1, 3} (? =(\d{3})+$)/g.'$&,) // $& is a match; $1, $2... To capture
/ / 87654321
Copy the code
  1. The password is regular and contains at least six characters, including at least one uppercase letter, one lowercase letter, and one digit
/ (? * [=.0-9(])? =.*[a-z])(? =.*[A-Z])^[0-9A-Za-z]{6,}$/.test('w44Y4S')
// The first three leading assertions are for the constraint on the leading ^ term/ ^. * (? . = {6(,})? =.*\d)(? =.*[A-Z])(? =.*[a-z])/.test('w44sYw')
Copy the code

reference

  1. Regex tutorial — A quick cheatsheet by examples
  2. Regular expression? = and? : and? ! The understanding of the
  3. [JS advanced] test, exec, match, replace
  4. Introduction to ES6 standard (3rd edition) — Ruan Yifeng
  5. A regular expression surprise in JavaScript