preface

Characters are the foundation of the foundation on which we write programs.

Our front end is the most common characters, symbols, numbers, English, Chinese, we usually use direct quantities to express, occasionally in the regular expression and other scenes use UTF-16 code point format, the question comes, then you know how many characters JS representation?

Answer: at least 6, using character A as an example:

`a`        // 'a'
'a'        // 'a'
'\a'       // 'a'
'\ 141'     // 'a'
'\x61'     // 'a'
'\u0061'   // 'a'
'\u{0061}' // 'a'
Copy the code

\, \x, \u, \u{} \ \x, \u, \u{}

Don’t worry. We’ll do it all.

More advanced knowledge of front-end basics, yes

  1. Pay attention to the front end of the column,
  2. Follow public accountThe application world of the cloud.
  3. In communication groupdirge-cloud

Special note: Test code executed on the latest version of Chrome.

Look at the summary

format The sample Code point range Pay attention to
8 hexadecimal \ '\ 141' 0-255. Cannot be used directly in a template string
\x binary hexadecimal '\x61' 0-255. Must be two
\u Four-digit hexadecimal system '\u0061' 0-65535. Must be four
\ u {hexadecimal} '\u{0061}' 0-0x10FFFF The code point is greater than 0xFFFF, the length is 2, and the subscript access value is the value of the high and low bits

Coding basics

To fully understand character representation, you still need some simple coding knowledge, so let’s take a look.

ASCII

ASCII codes define a total of 128 characters, with example letter A being 97 (0110 0001). The 128 characters use only the last seven bits of an 8-bit binary number, with the first one uniformly specified as zero.

The ASCII code defines a total of 128 characters, 33 of which cannot be displayed. 0010 0000 ~ 0111 1110 (32-126) can be displayed, basically can use the keyboard to type out, see the specific comparison table: ASCII coding comparison table.

The ASCII extension, EASCII, can now use 8 bits of a full subsection to represent 256 characters, including some derived Latin characters. See extended-ASCIi-table.

Unicode and code points

Unicode is a character set. In order to be compatible with ASCII, Unicode specifies that the first 0-127 characters are the same as ASCII, except for the 128-255 part.

Let’s take a look at ASCII 128-255:

Look again at parts 128-255 of Unicode:

It assigns a numeric value to a character, which is often called a code point. We can obtain it by charCodeAt and codePointAt of string instance methods. The former can only accurately obtain code points whose value is less than 0xFFFF(65535).

'𠀠'.codePointAt(0)  // 131104 0x20020 Correct
'𠀠'.charCodeAt(0)   // 55360 0xD840 An error occurs
'a'.charCodeAt(0)    // 97 0x0061 Correct
Copy the code

Corresponding we can use the String of static methods fromCharCode, fromCodePoint with code points to obtain the corresponding characters.

String.fromCodePoint(131104)  / / "𠀠" correctly
String.fromCharCode(131104)   // "" error
String.fromCharCode(97)       / / "a" is correct
Copy the code

UTF-8, UTF-16

Utf-8 and UTF-16 are both implementations of Unicode character writing. Our JS-encoded strings are stored and represented in UTF-16 format.

Utf-16 is represented by two bytes (one encoding unit) for code points less than 0xFFFF, and four bytes (two encoding units) for codes greater than 0xFFFF. This can be reflected in the length of the characters.

"𠀠".length  // 2 code point 131104 (0x20020) > 65535 (0xFFFF)
"a".length   / / 1
"People".length  / / 1
Copy the code

This emphasis on0xFFFFIt’s the dividing line. It’s important.

Hexadecimal conversion

We can convert the numeric instance toString() to base 10 to the corresponding base.

97..toString(16) / / 61
97..toString(2)  / / 1100001
Copy the code

Let’s look at the representation of a character:

\+ character

\ is a special existing escape character that, for the most part, has no effect. It only works with special characters.

As can be seen below, \a has no effect on \r, but not on \r.

For more information on escaping characters, see Escaping Characters – Wiki.

\ + octal

The range of code points it can represent ranges from 0 to 255.

The question here is whether the ASCII characters displayed here are Unicode characters or not.

  1. Such as characteraIs 97, which can be obtained using charCodeAt,
'a'.charCodeAt(0) / / 97
Copy the code
  1. Convert to base 897.. toString(8) = 141
  2. \ 141
console.log('\ 141')  // a
Copy the code

We look at characters with special code points, because characters with code points 31 and 127 cannot be displayed or represented.

/ / 37 = 31.. toString(8)
'37 \' // '\x1F'

/ / 177 = 127.. toString(8)
'\ 177'  // '\x7F'
Copy the code

As for why \177 changed to \x7F, if you are a little confused, in fact, it is very simple. When the program check its value, not in the display range, directly reverse calculation of its original value, and converted to a hexadecimal value, and use \x two hexadecimal format.

/ / 177 = 127.. toString(8)
'\ 177'  // '\x7F'
127.toString(16) // 7f
Copy the code

About the representation code point upper limit (255) :

'\ 377'  // "y" -- code point 255
'\ 400'  // if the value is greater than 255, it can be read as '\40' + '0'.
Copy the code

To answer the question at the beginning: we must be displaying Unicode characters. As mentioned above, the JS character encoding uses UTF-16. Try using a character with a code point between 128-255, then use a character with a code point of 254:

Extended ASCII 254:

Unicode 254: 'lay down'

/ / 376 = 254.. toString(8)
'\ 376'  / / 'lay down'
Copy the code

So it’s important to understand that all of our character representations represent Unicode characters.

Let’s start with the binary hexadecimal format \x.

\x + Double hexadecimal

We can use 0x to represent a hexadecimal number, so \x makes sense to you, in hexadecimal.

0x61    // 97 0x indicates a hexadecimal number
'a'.charCodeAt(0)  // Get code point 97
97..toString(16)   // Switch to hexadecimal 61
'\x61'  // 'a'
Copy the code

Two hexadecimal code points, 0x00 to 0xFF (0 to 255), like the \ octal code, are undisplayable code point characters that display their encoding directly

// 1f = 31.. toString(16)
'\x1F' // '\x1F'

/ / 20 = 32.. toString(16)
'\x20' / /"

// 7e = 126.. toString(16)
'\x7e' / / '~'

// 7f = 127.. toString(16)
'\x7f' // '\x7F'

/ / 80 = 128.. toString(16)
'\x80' // '\x80'
Copy the code

At this point you might say, well, you’re here, so this can’t show, that can’t show, is there a table? Yes, see the Unicode code table, which records characters from code points 0x0000 to 0xFFFF, which is generally sufficient.

0-255 code point range, 0x00 to 0x1F(0-31), 0 x80 x9f 0(128-159),It is not visible or invisible.

'\x9F'  // '\x9F' // encode output
'\xA1'  / / "¡" / / normal
Copy the code

This result, in the browser, may not be the same output. Use the latest version of Chrome to verify.

There are some differences in the output, but they all mean that this thing cannot represent a character, please respect yourself.

360 browser:

firefox:

chrome:

\u + The hexadecimal number is four digits

We’re going to have four digits here. We’re not going to lose one.

"\u0061" // "a"
"\u061"  / / an error
Copy the code

Still 4 bits, if there are more, cut the first 4 bits, the next directly append. Let me give you an example. It’s easy to understand.

'\u0061'   // 'a'
'\u00610'  // 'a0'
Copy the code

If the code point is greater than 0xFFFF and the code point is greater than 4 hexadecimal digits, how to represent the character?

ES6 takes this into account and comes up with \u{+ hexadecimal +} in the next section.

As we said earlier, UTF-16 is an implementation of Unicode, the Unicode proxy 0xD800-0xdFFf, which does not represent any characters. In the same way, we use \u + four hexadecimal notation. If the code point is in the interval, it returns � or the original character (depending on the browser). Of course, other code points may not be set or printable.

'\uD800'  // '\uD800' 
'\uDFFF'  // '\uD800'
Copy the code

In fact, UTF-16 uses the proxy region to divide characters with code points greater than 0xFFFF into high and low parts. The value obtained by index value 0 is actually the high part, and the value obtained by index value 1 is the low part.

var text = "𠀠";
text[0]  // '\uD840'
text[1]  // '\uDC20'
Copy the code

For more on UTF-16 encoding, we’ll follow.

\u{ + hexadecimal + }

New capabilities in ES6. This one has a {} package. This should be uniform and can represent characters with code points lower than 0xFFFF or characters with code points greater than 0xFFFF.

"\u{20020}"  / / '𠀠'
"\u{0061}"   // 'a'
"\u{061}"    // 'a'
"\u{61}"     // 'a'
"\u{9}"      // "\t"
Copy the code

And there is no mandatory four-digit limit, which is absolutely fantastic.

Disadvantages, that is compatibility, let time to smooth it out.

ES6 template character string

ES6 template strings are awesome, so let’s consider them one of the new character representations.

We can also use \u, \u{}, \x format

// 61 = "a".charCodeAt(0).toString(16)
` ` I \ u {61}  / / I am a
I ` \ x61 `    / / I am a
I \ u0061 ` `  / / I am a
I \ a ` `      / / I am a
Copy the code

Don’t you think anything’s missing here? That’s right. Base 8 is not allowed

// 141 = "a".charCodeAt(0).toString(8)
I \ 141 ` `
Copy the code

${”} if you must:

I `The ${'\ 141'}`   I am a '/ /'
Copy the code

Practical application

Matches the Chinese re

[u4e00-u9fa5] : [u4e00-u9fa5] : [u4e00-u9fa5]

var regZH = /[\u4e00-\u9fa5]/g;
regZH.test("a");   // false
regZH.test("People");  // true
regZH.test("𠀠");  // false embarrassed no
Copy the code

Note here, only common Chinese can be identified. After all, the range of code points is only that large.

Whitespace removed

Look at string.prototype.trim on MDN

if (!String.prototype.trim) {
  String.prototype.trim = function () {
    return this.replace(/^[\s\uFEFF\xA0]+|[\s\uFEFF\xA0]+$/g.' ');
  };
}

Copy the code

Let’s take a look at the famous core-js interpretation of trim whitespaces

module.exports = '\u0009\u000A\u000B\u000C\u000D\u0020\u00A0\u1680\u2000\u2001\u2002' +
  '\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u202F\u205F\u3000\u2028\u2029\uFEFF';
Copy the code

Let’s print it:

'\u0009\u000A\u000B\u000C\u000D\u0020\u00A0\u1680\u2000\u2001\u2002' +
  '\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u202F\u205F\u3000\u2028\u2029\uFEFF'
 / / '\ t \ n \ n \ \ f r      

' 
Copy the code

CSS Content property and CSS font icon

Modern browsers already support Chinese, but it is recommended to use hexadecimal encoding. The \ hexadecimal notation is used, and neither u nor {} is required. Characters with code points greater than 0xFFFF are supported

    div.me::before {
        content: "\ 6211"; / / Ipadding-right: 10px;
    }
    
     div.me::before {
        content: "\ 20020."; / / 𠀠padding-right: 10px;
    }
Copy the code

CSS color Color value

Font colors, background colors, border colors, and so on, and one way to do that is in hexadecimal 6 bits, but also in shorthand.

.title {
    color: #FFF
}
Copy the code

Character statistics

This takes advantage of utF-16 encoding features. Because \uD800-\DFFF is a proxy area, specific UTF-16 encoding things, explained separately. You can also see why the article must be � � with a full explanation.

const spRegexp = /[\uD800-\uDBFF][\uDC00-\uDFFF]/g;
function chCounts(str){
  return str.replace(spRegexp, '_').length;
}
Copy the code
chCounts("𠀠")  / / 1
"𠀠".length / / 2
Copy the code

In ES6 there are a few more:

Array.from("𠀠").length / / 1
[..."𠀠"].length  / / 1
Copy the code

File type identification

How to check the type of a file?

const isPNG = check([0x89.0x50.0x4e.0x47.0x0d.0x0a.0x1a.0x0a]); // PNG image corresponding magic number
const realFileElement = document.querySelector("#realFileType");

async function handleChange(event) {
  const file = event.target.files[0];
  const buffers = await readBuffer(file, 0.8);
  const uint8Array = new Uint8Array(buffers);
  realFileElement.innerText = `${file.name}The file type is:${
    isPNG(uint8Array) ? "image/png" : file.type
  }`;
}

Copy the code

other

  • Emoji ICONS
  • Encoding conversion, such as UTF-8 to Base64
  • , etc.

Write in the last

Do not forget the original intention, gain, but not tired, if you think it is good, your praise and comment is the biggest motivation for me to move forward.

Please go to the technical exchange groupCome here,. Or add my wechat Dirge-Cloud and learn together.

reference

ASCII Table ASCII Table Unicode string extension why “𠮷𠮷𠮷”.length! (ASCII, Unicode, UTF-8, UTF-16, UTF-32)