[ES6 series] Strings and regular expressions

Better Unicode support

Prior to ES6, JavaScript strings were built on the basis of 16-bit character encodings (UTF-16). Each 16-bit sequence was a code unit. In the past, 16-bit character encodings were sufficient to contain any character, until Unicode introduced the extended character set, and the encoding rules were no longer sufficient. Changes had to be made.

UTF – 16 yards
- The goal of Unicode is to provide a globally unique identifier (code point) for every character in the world. However, if the encoding unit is limited to 16 bits in length, there are not enough code points to represent so many characters. Character encoding must encode code points into internally consistent coding units
- In UTF – 16, top 216 yards are expressed as a 16-bit coding unit, this range is called a basic practice more plane, is beyond the scope of code are bound to the auxiliary plane, and the code using 16-bit couldn’t said, in order to utf-8 16 introduced agent for its rules with two 16-bit coding unit said a code
- That is, there are two types of characters in a string, one is the basic multilingual plane represented by a 16-bit encoding unit, and the other is the auxiliary plane character represented by two 32-bit encoding units
- In ES5, all string operations are based on 16-bit encoding units, and if you do the same with UTF-16 encoded characters that contain proxy pairs, the results may not be what you expect
- String attributes and methods such as length, charAt() are constructed from 16-bit coding units
- For example
```
'𠮷'.length //2 'li '.length //1Copy the code
```
  The character ‘𠮷’ is represented by a proxy pair, so length will consider its length to be 2. The effects are as follows:
  1. Length judgment failed
  2. Failed to match the regular expression of a single character
  3. Neither of the previous 16-bit encoding units represents any printable characters, and charAt() does not return a valid string
  4. The charCodeAt() method does not correctly identify characters, and returns the value for each 16-bit encoding unit
- ES6 solves this problem by forcing the use of UTF-16 string encodings, standardizing string operations according to this character encodings, and adding functionality in JS specifically for proxy pairs

CodePointAt () method

New to ES6 is the codePointAt() method that fully supports UTF-16. This method takes the position of the encoding unit, not the character position, and returns the code point corresponding to the given position in the string, i.e. an integer value

MDN source

/ *! http://mths.be/codepointat v0.1.0 by @ mathias * / if (! String.prototype.codePointAt) { (function() { 'use strict'; // Strict mode, needed to support `apply`/`call` with `undefined`/`null` var codePointAt = function(position) { if (this == null) { throw TypeError(); } var string = String(this); var size = string.length; Var index = position? Number(position) : 0; if (index ! = index) { // better `isNaN` index = 0; } / / boundary the if (index < 0 | | index > = size) {return undefined; } var first = string.charcodeat (index); var second; If (// Check to start surrogate pair first >= 0xD800 && first <= 0xDBFF && // High Surrogate Size > index + 1 // Next code unit) { second = string.charCodeAt(index + 1); if (second >= 0xDC00 && second <= 0xDFFF) { // low surrogate // http://mathiasbynens.be/notes/javascript-encoding#surrogate-formulae return (first - 0xD800) * 0x400 + second - 0xDC00 +  0x10000; } } return first; }; if (Object.defineProperty) { Object.defineProperty(String.prototype, 'codePointAt', { 'value': codePointAt, 'configurable': true, 'writable': true }); } else { String.prototype.codePointAt = codePointAt; }} ()); }Copy the code

For characters in the BMP character set, the codePointAt() method returns the same value as the charCodeAt() method

The return value is different for characters in the non-BMP character set, for example

'𠮷a'.length //3 charCodeAt(0)// returns only the first codePointAt position 0 codePointAt(0)// returns the full code point, even if the code point contains multiple coding unitsCopy the code

To check the number of encoding units occupied by a character, check with the codePointAt() method
```
function is32Bit(c){
  return c.codePointAt(0) > 0xFFFF;
}
Copy the code
```
Principle: A 16-bit character set has an upper bound of hexadecimal FFFF. All code points above this upper bound must be represented by two coding units

String. FromCodePoint () method
- ES usually provides positive and negative methods for the same operation:
  1. The codePointAt() method retries the code point of a character in a string
  2. String.fromcodepoint () method: generates a character from the specified codepoint; The full version of string.fromCharCode () can recognize non-BMP characters
Normalie () method
- In Unicode, if we sort or compare two different characters, there is a possibility that they are equivalent
- Two ways to define equivalence:
  1. They are identical, both in terms of code point and presentation
  2. Compatibility. Two compatible code point sequences may look different, but can be used interchangeably under certain circumstances. Here’s an example:
```
// '\u004F\u030C'~79 780 // '\u01D1'~465 string.fromcharcode (79)//"O" string.fromcharcode (780)//" qiang" String. FromCharCode (465) / / "Ǒ" String. FromCharCode (79780) / / "Ǒ"Copy the code
```
    Code points (‘\u004F\u030C’) and code points (‘\u01D1’) are interchangeable under certain circumstances
    
    However, if you use the normal “==” or “===”, they will never be equal and will always be false
```
'\u01D1'==='\u004F\u030C' //false
 
'\u01D1'.length // 1
'\u004F\u030C'.length // 2
Copy the code
```
- For the compatibility case above, ES6 provides the normalize() method for string instances, which is used to unify different representations of characters into the same form, known as Unicode normalization.
```
'\u01D1'.normalize() === '\u004F\u030C'.normalize()  // true
Copy the code
```
- The normalize method can take a parameter that specifies how to normalize. The four optional values of this parameter are as follows:
  
  NFC, the default parameter, stands for “Normalization Form Canonical Composition” and returns a composite of multiple simple characters. Standard equivalence refers to visual and semantic equivalence.
  
  NFD, which stands for “Normalization Form Canonical Decomposition,” returns multiple simple characters of composite character Decomposition under standard equivalence.
  
  NFKC, which stands for Normalization Form Compatibility Composition, returns the composite character. Compatible equivalence refers to semantic equivalence but visual non-equivalence, such as Xi and Xixi. (This is just an example. The normalize method does not recognize Chinese.)
  
  NFKD, which stands for Compatibility Form Compatibility Decomposition, returns multiple simple characters of composite character Decomposition with Compatibility equivalence.
- The NFC parameter returns the synthesized form of the character, and the NFD parameter returns the decomposed form of the character.
```
'\u004F\u030C'.normalize('NFC').length // 1
'\u004F\u030C'.normalize('NFD').length // 2
Copy the code
```
- The normalize method does not currently recognize compositions of three or more characters
Regular expression u modifier
- Regular expressions can perform simple string operations, but by default each character in the string is treated as a 16-bit encoding unit. To solve this problem, ES6 defines a Unicode u modifier for regular expressions
- When the u modifier is added to the regular expression, it switches from encoder mode to character mode, so that the regular expression does not treat the proxy pair as two characters
- ES6 does not support detection of the number of string code points, but with the U modifier, the number can be calculated through regular expressions
```
function codePointLength(text){ let res = text.match(/[\s\S]/gu) return res? Res.length :0} codePointLength(' ABC ') //3 codePointLength('𠮷a') //3Copy the code
```
  But this method is very inefficient
- Check for support for u modifier. ES5 does not support u modifier. Be careful before using it
```
function hasRegExpU(){
  try{
    let pattern = new RegExp('.','u')
    return true
  }catch(error){
    return false
  }
}
Copy the code
```

Other string changes

Substring recognition in a string
- The includes() method returns true if the specified text is detected in the string, false otherwise
- The startWith() method, which returns true if the specified text is detected at the beginning of the string, false otherwise
- The endWith() method returns true if the specified text is detected at the end of the string, false otherwise
All three methods take an optional second argument, specifying an index to start the search. Specifying the second argument greatly reduces the range of the string to be searched
Repeat () syntax
- This method takes a parameter of type number, which represents the number of times the string has been repeated, and returns a new string after the current string has been repeated a certain number of times
```
'x'.repeat(3) //xxx
Copy the code
```

Other regular expression syntax changes

Regular expression y modifier
- It’s called the “sticky” modifier. Used to properly handle strings that match adhesions
- Similar to the g modifier, it is a global match, except that the y modifier only matches the header of the rest of the string at a time. If it does not match, it exits the match. For example:
```
let str = "aaa_aa_aaaa"
let reg_g = /a+/g
let reg_y = /a+/y

reg_g.exec(str)
// aaa
reg_y.exec(str)
// aaa

reg_g.exec(str)
// aa
reg_y.exec(str)
// null
Copy the code
```
- Based on the above example, there is one additional note: about lastIndex
  
  LastIndex is the place where the match starts, which is specified
  
  For example:
```
let str = "_aaa_aa_aaaa"
let reg_y = /a+/y
reg_y.lastIndex = 1
reg_y.exec(str)
// aaa
reg_y.lastIndex = 5
reg_y.exec(str)
// aa
reg_y.lastIndex = 8
reg_y.exec(str)
// aaaaa
Copy the code
```
  Only calls to methods on regular expression objects, such as exec() and test(), involve the lastIndex property, and calls to methods on strings, such as match, do not trigger sticky behavior
- When the operation is performed, the y modifier saves the index of the last matched character to lastIndex, which is set to 0 if the match fails. Regular expressions without modifiers match without lastIndex
- Check that the y modifier is available
```
function hasRegExpU(){
  try{
    let pattern = new RegExp('.','y')
    return true
  }catch(error){
    return false
  }
}
Copy the code
```
Replication of regular expressions
- In ES5, you can copy this regular expression by passing a regular expression argument to the RegExp constructor,
```
let re1 = /ab/i,
    re2 = new RegExp(re1)
Copy the code
```
- In ES5, when copying a regular expression, you cannot add modifiers to re2. ES6 fixes this problem, and even if you are copying, you can modify it on top of the copy
```
Let re1 = /ab/ I, re2 = new RegExp(re1,'g'Copy the code
```
Flags properties
- The source property gets the text of the regular expression
- The flags attribute gets a modifier for a regular expression
```
Re1 = new RegExp(/ab/ I) re2 = new RegExp(/ab/ I) re2 = new RegExp(/ab/ g) re2.source //'ab' re2.flags //' g'Copy the code
```

Template literal

ES6 tries to break out of JavaScript’s existing string system and fill in some of the gaps in ES5 by using the = template literals:
1. Multiline string
2. Basic string formatting: The ability to embed the value of a variable into a string
3. HTML escape: The ability to insert securely converted strings into HTML

Basic grammar
- The backquote “‘” replaces single and double quotes. If you want to use backquotes in a string, you can escape them. You do not need to escape single and double quotes in a template literal
Multiline string
- All the whitespace and newlines in the backquotes are meaningful and part of the string
String placeholder
- The placeholder consists of a $and a pair of {}, and you can insert any JavaScript expression between the braces
- Template literals can be nested using placeholders
```
`hello,${`world`}`
Copy the code
```
The label template
- Each template tag can perform a conversion on a template literal and return the final string value
- A template tag is a function that processes template literals (map and Reduce functions that behave like arrays).
```
let tag = function(literals,... Let message = tag 'hello' // The value of the message is equal to the last string returned in the tag functionCopy the code
```
- Learn more about tag functions
```
let tag = function(literals,... Substitutions){// return a string} let substitutions = 10, price = 0.5, message = tag`${count} items cost $${(count*price).toFixed(2)}.`Copy the code
```
  1. Label functions generally use the indefinite parameter property to define placeholders… substitutions
  2. In this case, the tag receives the following parameters:
    
    Literals array: contains the elements :[“,’ items cost $’,’.’], which separates the template literals with placeholders
    
    Substitutions array: contains all resolved values of the placeholder [10,0.50]
  3. The first element in the literals array is an empty string, ensuring that literals[0] is the beginning of the string
  4. The length of the literals array is always 1 more than the length of substitutions
- Use raw values in template literals
  
  With string.raw (), the template tag can also access the native String information, that is, the template tag can access the native String before the character is escaped and converted to its equivalent
  
  Within the tag function, literals[0] always has an equivalent literals.raw[0], which contains its native string information

[ES6 series] Strings and regular expressions

Better Unicode support

Other string changes

Other regular expression syntax changes

Template literal

Related Posts

Js apply and math.max () function problems and differences

[Luffy]_LeetCode 222 Number of nodes in a complete binary tree

Bootstrap5 breadcrumb navigation component usage, section 15. Bootstrap5 from Zero to Master