preface
Characters are the foundation of the foundation on which we write programs.
Our front end is the most common characters, symbols, numbers, English, Chinese, we usually use direct quantities to express, occasionally in the regular expression and other scenes use UTF-16 code point format, the question comes, then you know how many characters JS representation?
Answer: at least 6, using character A as an example:
`a` // 'a'
'a' // 'a'
'\a' // 'a'
'\ 141' // 'a'
'\x61' // 'a'
'\u0061' // 'a'
'\u{0061}' // 'a'
Copy the code
\, \x, \u, \u{} \ \x, \u, \u{}
Don’t worry. We’ll do it all.
More advanced knowledge of front-end basics, yes
- Pay attention to the front end of the column,
- Follow public account
The application world of the cloud
. - In communication group
dirge-cloud
Special note: Test code executed on the latest version of Chrome.
Look at the summary
format | The sample | Code point range | Pay attention to |
---|---|---|---|
8 hexadecimal \ | '\ 141' |
0-255. | Cannot be used directly in a template string |
\x binary hexadecimal | '\x61' |
0-255. | Must be two |
\u Four-digit hexadecimal system | '\u0061' |
0-65535. | Must be four |
\ u {hexadecimal} | '\u{0061}' |
0-0x10FFFF | The code point is greater than 0xFFFF, the length is 2, and the subscript access value is the value of the high and low bits |
Coding basics
To fully understand character representation, you still need some simple coding knowledge, so let’s take a look.
ASCII
ASCII codes define a total of 128 characters, with example letter A being 97 (0110 0001). The 128 characters use only the last seven bits of an 8-bit binary number, with the first one uniformly specified as zero.
The ASCII code defines a total of 128 characters, 33 of which cannot be displayed. 0010 0000 ~ 0111 1110 (32-126) can be displayed, basically can use the keyboard to type out, see the specific comparison table: ASCII coding comparison table.
The ASCII extension, EASCII, can now use 8 bits of a full subsection to represent 256 characters, including some derived Latin characters. See extended-ASCIi-table.
Unicode and code points
Unicode is a character set. In order to be compatible with ASCII, Unicode specifies that the first 0-127 characters are the same as ASCII, except for the 128-255 part.
Let’s take a look at ASCII 128-255:
Look again at parts 128-255 of Unicode:
It assigns a numeric value to a character, which is often called a code point. We can obtain it by charCodeAt and codePointAt of string instance methods. The former can only accurately obtain code points whose value is less than 0xFFFF(65535).
'𠀠'.codePointAt(0) // 131104 0x20020 Correct
'𠀠'.charCodeAt(0) // 55360 0xD840 An error occurs
'a'.charCodeAt(0) // 97 0x0061 Correct
Copy the code
Corresponding we can use the String of static methods fromCharCode, fromCodePoint with code points to obtain the corresponding characters.
String.fromCodePoint(131104) / / "𠀠" correctly
String.fromCharCode(131104) // "" error
String.fromCharCode(97) / / "a" is correct
Copy the code
UTF-8, UTF-16
Utf-8 and UTF-16 are both implementations of Unicode character writing. Our JS-encoded strings are stored and represented in UTF-16 format.
Utf-16 is represented by two bytes (one encoding unit) for code points less than 0xFFFF, and four bytes (two encoding units) for codes greater than 0xFFFF. This can be reflected in the length of the characters.
"𠀠".length // 2 code point 131104 (0x20020) > 65535 (0xFFFF)
"a".length / / 1
"People".length / / 1
Copy the code
This emphasis on0xFFFF
It’s the dividing line. It’s important.
Hexadecimal conversion
We can convert the numeric instance toString() to base 10 to the corresponding base.
97..toString(16) / / 61
97..toString(2) / / 1100001
Copy the code
Let’s look at the representation of a character:
\
+ character
\ is a special existing escape character that, for the most part, has no effect. It only works with special characters.
As can be seen below, \a has no effect on \r, but not on \r.
For more information on escaping characters, see Escaping Characters – Wiki.
\
+ octal
The range of code points it can represent ranges from 0 to 255.
The question here is whether the ASCII characters displayed here are Unicode characters or not.
- Such as character
a
Is 97, which can be obtained using charCodeAt,
'a'.charCodeAt(0) / / 97
Copy the code
- Convert to base 8
97.. toString(8) = 141
\ 141
console.log('\ 141') // a
Copy the code
We look at characters with special code points, because characters with code points 31 and 127 cannot be displayed or represented.
/ / 37 = 31.. toString(8)
'37 \' // '\x1F'
/ / 177 = 127.. toString(8)
'\ 177' // '\x7F'
Copy the code
As for why \177 changed to \x7F, if you are a little confused, in fact, it is very simple. When the program check its value, not in the display range, directly reverse calculation of its original value, and converted to a hexadecimal value, and use \x two hexadecimal format.
/ / 177 = 127.. toString(8)
'\ 177' // '\x7F'
127.toString(16) // 7f
Copy the code
About the representation code point upper limit (255) :
'\ 377' // "y" -- code point 255
'\ 400' // if the value is greater than 255, it can be read as '\40' + '0'.
Copy the code
To answer the question at the beginning: we must be displaying Unicode characters. As mentioned above, the JS character encoding uses UTF-16. Try using a character with a code point between 128-255, then use a character with a code point of 254:
Extended ASCII 254:
Unicode 254: 'lay down'
/ / 376 = 254.. toString(8)
'\ 376' / / 'lay down'
Copy the code
So it’s important to understand that all of our character representations represent Unicode characters.
Let’s start with the binary hexadecimal format \x.
\x
+ Double hexadecimal
We can use 0x to represent a hexadecimal number, so \x makes sense to you, in hexadecimal.
0x61 // 97 0x indicates a hexadecimal number
'a'.charCodeAt(0) // Get code point 97
97..toString(16) // Switch to hexadecimal 61
'\x61' // 'a'
Copy the code
Two hexadecimal code points, 0x00 to 0xFF (0 to 255), like the \ octal code, are undisplayable code point characters that display their encoding directly
// 1f = 31.. toString(16)
'\x1F' // '\x1F'
/ / 20 = 32.. toString(16)
'\x20' / /"
// 7e = 126.. toString(16)
'\x7e' / / '~'
// 7f = 127.. toString(16)
'\x7f' // '\x7F'
/ / 80 = 128.. toString(16)
'\x80' // '\x80'
Copy the code
At this point you might say, well, you’re here, so this can’t show, that can’t show, is there a table? Yes, see the Unicode code table, which records characters from code points 0x0000 to 0xFFFF, which is generally sufficient.
0-255 code point range, 0x00 to 0x1F(0-31)
, 0 x80 x9f 0(128-159),
It is not visible or invisible.
'\x9F' // '\x9F' // encode output
'\xA1' / / "¡" / / normal
Copy the code
This result, in the browser, may not be the same output. Use the latest version of Chrome to verify.
There are some differences in the output, but they all mean that this thing cannot represent a character, please respect yourself.
360 browser:
firefox:
chrome:
\u
+ The hexadecimal number is four digits
We’re going to have four digits here. We’re not going to lose one.
"\u0061" // "a"
"\u061" / / an error
Copy the code
Still 4 bits, if there are more, cut the first 4 bits, the next directly append. Let me give you an example. It’s easy to understand.
'\u0061' // 'a'
'\u00610' // 'a0'
Copy the code
If the code point is greater than 0xFFFF and the code point is greater than 4 hexadecimal digits, how to represent the character?
ES6 takes this into account and comes up with \u{+ hexadecimal +} in the next section.
As we said earlier, UTF-16 is an implementation of Unicode, the Unicode proxy 0xD800-0xdFFf, which does not represent any characters. In the same way, we use \u + four hexadecimal notation. If the code point is in the interval, it returns � or the original character (depending on the browser). Of course, other code points may not be set or printable.
'\uD800' // '\uD800'
'\uDFFF' // '\uD800'
Copy the code
In fact, UTF-16 uses the proxy region to divide characters with code points greater than 0xFFFF into high and low parts. The value obtained by index value 0 is actually the high part, and the value obtained by index value 1 is the low part.
var text = "𠀠";
text[0] // '\uD840'
text[1] // '\uDC20'
Copy the code
For more on UTF-16 encoding, we’ll follow.
\u{
+ hexadecimal
+ }
New capabilities in ES6. This one has a {} package. This should be uniform and can represent characters with code points lower than 0xFFFF or characters with code points greater than 0xFFFF.
"\u{20020}" / / '𠀠'
"\u{0061}" // 'a'
"\u{061}" // 'a'
"\u{61}" // 'a'
"\u{9}" // "\t"
Copy the code
And there is no mandatory four-digit limit, which is absolutely fantastic.
Disadvantages, that is compatibility, let time to smooth it out.
ES6 template character string
ES6 template strings are awesome, so let’s consider them one of the new character representations.
We can also use \u, \u{}, \x format
// 61 = "a".charCodeAt(0).toString(16)
` ` I \ u {61} / / I am a
I ` \ x61 ` / / I am a
I \ u0061 ` ` / / I am a
I \ a ` ` / / I am a
Copy the code
Don’t you think anything’s missing here? That’s right. Base 8 is not allowed
// 141 = "a".charCodeAt(0).toString(8)
I \ 141 ` `
Copy the code
${”} if you must:
I `The ${'\ 141'}` I am a '/ /'
Copy the code
Practical application
Matches the Chinese re
[u4e00-u9fa5] : [u4e00-u9fa5] : [u4e00-u9fa5]
var regZH = /[\u4e00-\u9fa5]/g;
regZH.test("a"); // false
regZH.test("People"); // true
regZH.test("𠀠"); // false embarrassed no
Copy the code
Note here, only common Chinese can be identified. After all, the range of code points is only that large.
Whitespace removed
Look at string.prototype.trim on MDN
if (!String.prototype.trim) {
String.prototype.trim = function () {
return this.replace(/^[\s\uFEFF\xA0]+|[\s\uFEFF\xA0]+$/g.' ');
};
}
Copy the code
Let’s take a look at the famous core-js interpretation of trim whitespaces
module.exports = '\u0009\u000A\u000B\u000C\u000D\u0020\u00A0\u1680\u2000\u2001\u2002' +
'\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u202F\u205F\u3000\u2028\u2029\uFEFF';
Copy the code
Let’s print it:
'\u0009\u000A\u000B\u000C\u000D\u0020\u00A0\u1680\u2000\u2001\u2002' +
'\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u202F\u205F\u3000\u2028\u2029\uFEFF'
/ / '\ t \ n \ n \ \ f r
'
Copy the code
CSS Content property and CSS font icon
Modern browsers already support Chinese, but it is recommended to use hexadecimal encoding. The \ hexadecimal notation is used, and neither u nor {} is required. Characters with code points greater than 0xFFFF are supported
div.me::before {
content: "\ 6211"; / / Ipadding-right: 10px;
}
div.me::before {
content: "\ 20020."; / / 𠀠padding-right: 10px;
}
Copy the code
CSS color Color value
Font colors, background colors, border colors, and so on, and one way to do that is in hexadecimal 6 bits, but also in shorthand.
.title {
color: #FFF
}
Copy the code
Character statistics
This takes advantage of utF-16 encoding features. Because \uD800-\DFFF is a proxy area, specific UTF-16 encoding things, explained separately. You can also see why the article must be � � with a full explanation.
const spRegexp = /[\uD800-\uDBFF][\uDC00-\uDFFF]/g;
function chCounts(str){
return str.replace(spRegexp, '_').length;
}
Copy the code
chCounts("𠀠") / / 1
"𠀠".length / / 2
Copy the code
In ES6 there are a few more:
Array.from("𠀠").length / / 1
[..."𠀠"].length / / 1
Copy the code
File type identification
How to check the type of a file?
const isPNG = check([0x89.0x50.0x4e.0x47.0x0d.0x0a.0x1a.0x0a]); // PNG image corresponding magic number
const realFileElement = document.querySelector("#realFileType");
async function handleChange(event) {
const file = event.target.files[0];
const buffers = await readBuffer(file, 0.8);
const uint8Array = new Uint8Array(buffers);
realFileElement.innerText = `${file.name}The file type is:${
isPNG(uint8Array) ? "image/png" : file.type
}`;
}
Copy the code
other
- Emoji ICONS
- Encoding conversion, such as UTF-8 to Base64
- , etc.
Write in the last
Do not forget the original intention, gain, but not tired, if you think it is good, your praise and comment is the biggest motivation for me to move forward.
Please go to the technical exchange groupCome here,. Or add my wechat Dirge-Cloud and learn together.
reference
ASCII Table ASCII Table Unicode string extension why “𠮷𠮷𠮷”.length! (ASCII, Unicode, UTF-8, UTF-16, UTF-32)