Talk about ES6 regular expressionsucharacter

ES6 re added u modifier, can make JS re use more powerful, accurate.

You can use the RegExp. Prototype. unicode attribute to determine whether the re uses the U modifier.

We look at the role of the Kangkang U modifier from the following scenario:

  1. Match any character

    1. In the old way, it can be used[\d\D],[\s\S],[\w\W],(^)Match any character
      /[\d\D]/.test("\r");
      //true
      Copy the code
    2. Of course, it can also be given.The meaning of, i.e[^\n\r\u2028\u2029], and then use.|[\n\r\u2028\u2029]Matches any character;
          /^(.|[\n\r\u2028\u2029])*$/.test("1 a \ t \ n \ n \ r \ f - _ (the");
          //true
      Copy the code
    3. Another approach is to use new additions to ES6sThe modifier. It will make special characters.Ability to match any single Unicode Basic Multilingual Plane (BMP) character:
      "\n".match(/ /.);
      //null
      "\n".match(/ /.s);
      //[" Address ", index: 0, INPUT: "Access ", Groups: undefined]
      Copy the code

      sWhether the modifier is used or not can be passedRegExp.prototype.dotAllAttribute judgment:

      /./.dotAll;
      //false
      / /.s.dotAll;
      //true
      Copy the code

    However, this is not the end of the story. When you try to match the following example directly with the above re, you will find that the character is not recognized properly:

    "𠮷".match(/[\d\D]/);
    //["�", index: 0, input: "𠮷", groups: undefined]
    Copy the code

    To correctly identify such Unicode characters with code points greater than \uFFFF, the U modifier comes in handy. It will correctly handle four bytes of UTF-16 encoding:

    "𠮷".match(/[\d\D]/u);
    / / / "𠮷", index: 0, input: "𠮷 groups: undefined]
    Copy the code
  2. Match any position

    We know that \b matches the boundary between words and non-words, and \b matches the boundary between words. For those who know the u modifier, write /[\b\ b]/ug. Although it seems more reasonable, it is still not the correct solution.

    The correct method should be: / \ | b \ b/ug. Not only is the u modifier used, but also the meaning of /[\b]/, which matches the escape character \b itself and the backspace key \u0008.

    In addition, for the word boundary \b, both javascript and Java re refer to the string \w is [a-za-z0-9_]. In.NET, regular expression words are defined as strings of [A-zA-z0-9] and Unicode characters (Chinese characters, full-corner characters, etc.).

  3. Correct identification of quantifiers

    With the U modifier, all quantifiers correctly recognize Unicode characters with code points greater than 0xFFFF.

  4. Correct recognition of predefined patterns

    In re, \d, \w, \s, etc. are predefined classes. Similarly, Unicode characters with code points greater than 0xFFFF can be matched correctly only with the u modifier.

  5. Standardize the writing of escape characters

    This will help you understand what regular metacharacters are, because an error will be reported when an invalid escape character is escaped:

        /\a/
        // /\a/
        /\a/u
        // Uncaught SyntaxError: Invalid regular expression: /\,a/: Invalid escape
    Copy the code
  6. Identifies non-canonical characters with the I modifier

    /[a-z]/iu.test('\u212A')
    // true
    /[a-z]/iu.test('\u004B')
    // true
    Copy the code
  7. Unicode’s own attribute class \p{… } and \ {P… }

    This new attribute class matches all characters that match a Unicode attribute. There are two ways to write it:

    1. Specify the attribute name and value
      \p{UnicodePropertyName=UnicodePropertyValue}
      
      / / example:
      /\p{Script=Greek}/u  // Specify a Greek character to match
      Copy the code
    2. Write only the attribute name or attribute value
      \p{UnicodePropertyName}
      \p{UnicodePropertyValue}
      
      / / example:
      /\p{White_Space}/u // Matches all Spaces/\p{Emoji_Modifier_Base}\p{Emoji_Modifier}? |\p{Emoji_Presentation}|\p{Emoji}\uFE0F/gu/ / match Emoji
      Copy the code