basis
Unicode is a character set that assigns a code to almost every character known in the world. The code corresponds to the character. Utf-8 defines a way to store encoding. Similarly, utF-16 /UTF-32/GB2312/GBK
Utf-8: A variable-length character encoding (1 to 4 bytes) for Unicode. The first byte of the encoding is ASCII compatible.
Utf-16: Similar to UTF-8, it is a variable length character encoding, which may be 2 or 4 bytes in length.
Utf-32: Uses 4 bytes (32 bits) to represent a Unicode character without any conversion. The disadvantage is that it takes up more space.
Ucs-2: UTF-16 encoding is a superset of UCS-2 encoding. Before there is no auxiliary plane (the last two bytes of UTF-16), UTF-16 and UCS-2 refer to the same thing, one character takes up two bytes of space. JavaScript uses UCS-2 encoding, so special attention should be paid to character code points greater than U+FFFF.
Utf-8 encoding (rfc3629)
Utf-8 is a variable length encoding, occupying a minimum of 1 byte and a maximum of 4 bytes of space. The encoding rules (rfc3629#section-3) are as follows:
Single byte: Occupies 1 byte (8 bits) of space. The first byte starts with 0 and the next 7 bits are Unicode code points, up to 0B01111111.
Multi-byte: occupies 2 to 4 bytes (16 to 32 bits) of space. Coding rules:
- The 8 bits of the first byte, the number of consecutive bits in the first byte
1
Indicates how many bytes of space the character occupies. It should be 2-4 bytes1
And then we have 10
, the subsequent bits are used to represent the bits of the code point - Subsequent bytes (2nd to 4th bytes) : The number of bytes preceding the first byte
1
It was decided that the subsequent bytes would be 2/3/4. The first two bits of subsequent bytes are10
The bits following are used to represent the bits of the code point. Complete code points are represented as: significant bits of the first byte + significant bits of subsequent bytes
Example (valid code points are in bold) :
character | Unicode code | The number of bytes | Utf-8 encoding |
---|---|---|---|
‘a’ | 97 | 1 | 01100001 |
‘Ɛ’ | 400 | 2 | 11000110 10010000 |
‘一’ | 19968 | 3 | 11100100 10111000 10000000 |
‘𠮷’ | 134071 | 4 | 11110000 10100000 10101110 10110111 |
Use C++ to get the code point value of utf-8 encoding character
@param STR - string * @param index - string index */
unsigned int getCodePointAt(string &str, int index) {
int code = (unsigned char) str[index];
unsigned int baseBits = 0b10000000;
unsigned int lastBits;
// Count the number of 1 bits before the first byte
int bytes = 0;
if ((code & baseBits) == baseBits) { // 0b1xxxxxxx
do {
lastBits = baseBits;
baseBits = baseBits | (baseBits >> 1);
bytes++;
} while ((code & baseBits) == baseBits);
} else {
return code;
}
// 3 bytes Example: 1110XXXX 10xxXXXX 10xxxxxx
// The number of bytes of the first byte is 8-1. The number of bytes of each subsequent byte is 6
// We need to move the high level to the left by 6 * (NTH byte -1) to leave space for the low level
int i = bytes - 1;
unsigned int result = (code ^ lastBits) << (i * 6);
while (i >= 1) {
int c = (unsigned char) str[index + i];
result += (c & 0b00111111) << ((bytes - i - 1) * 6);
i--;
}
return result;
}
Copy the code
Utf-16 specification details (rfc2781)
Encoded by two or four bytes, the following encoding or decoding rules are translated from the specification:
Coding rules (Rfc2781 # section 2.1) :
To represent the character code as U, the number should not be greater than 0x10FFFF
- If U < 0x10000, simply encode U as a 16-bit unsigned integer, end;
- Otherwise, set U’ = u-0x10000, since U is less than or equal to 0x10FFFF, U’ must be less than or equal to 0xFFFFF, then U’ can be represented as 20 bits (instructions:
0xFFFFF == 0b11111111111111111111
, corresponding to 20 binary numbers1
) - Initialize two 16-bit unsigned integers, named W1 and W2, with W1 starting at 0xD800 and W2 starting at 0xDC00, each with 10 bits left to encode characters, adding up to 20 bits.instructions: the last 10 bits of 0xD800 and 0xDC00 are both
0
, here is using the space.) - Allocate 10 high bits (the first 10 bits of U’) of the 20 bits of U’ to W1 and 10 low bits (the last 10 bits of U’) to W2. End
Steps 2 through 4 look like this:
U’ = yyyyyyyyyyxxxxxxxxxx
W1 = 110110yyyyyyyyyy
W2 = 110111xxxxxxxxxx
Decoding rules (Rfc2781 # section 2.2)
Let W1 be a 16-bit integer representing a sequence of integers for text (colloquial, that is, an array of 16-bit integers representing a string), and let W2 be the 16-bit integer immediately following W1 (which is also the last integer used to represent the current character)
- If W1 < 0xD800 or W1 > 0xDFFF, the value of W1 is the code point value of the character, end;
- Ensure that W1 is between 0xD800 and 0xDBFF. If not, the character is not a valid character, end;
- If there is no W2 (a digit after W1) or if W2 is not between 0xDC00 and 0xDFFF, the character sequence is wrong and the end;
- Construct a 20-bit unsigned integer U’, fill the last 10 bits of W1 with the highest of U’, and fill the last 10 bits of W2 with the lowest of U’;
- U’ + 0x10000 = U, U is the code point value of character, end;
conclusion
Utf-16 encoding is divided into two bytes and four bytes,
- The value ranges from 0x0000 to 0xD799 and 0xDE00 to 0xFFFF
- Four-byte range: the first two bytes (0xD800 to 0xDBFF) and the last two bytes (0xDC00 to 0xDFFF)
The sample
character | Unicode code | The number of bytes | UTF – 16 coding |
---|---|---|---|
‘a’ | 97 | 2 | 00000000, 01100001, |
‘Ɛ’ | 400 | 2 | 00000001, 10010000, |
‘一’ | 19968 | 2 | 01001110, 00000000, |
‘𠮷’ | 134071 | 4 | 11011000 01000010 11011111 10110111 |
To verify the decoding rules, use the ‘𠮷’ character (UTF-16 encoding: 11011000 01000010 11011111 10110111) :
- The value of reading the first double byte is 55362, between 0xD800 and 0xDBFF, so the character is represented by 4 bytes, i.e. the next double byte needs to be read
- Read the next double-byte value 57271, verify between 0xDC00 and 0xDFFF, so the character is valid
- Take the 10 lowest bits of the first double byte (
0001000010
), and the 10 lower bits of the next double byte (1110110111
) to form a 20-digit integer, namely:00010000101110110111
, converts to decimal bit 68535, and finally adds 0x10000 to get code point value:68535 + 0x10000 = 134071
JavaScript and character encoding
Since JavaScript uses UCS-2 encoding, that is, the same double-byte rule as UTF-16 encoding. Therefore, when the code point is greater than or equal to 0x10000, the following phenomenon can be seen:
'𠮷'.length / / 2
Copy the code
This problem can be circumvented in the following ways:
Array.from('𠮷').length / / 1
// Replace all characters with a code point value greater than or equal to 0x10000 (JS defaults to 2 characters) with a single character
'𠮷'.replace(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g.'_').length / / 1
Copy the code
Get the code point of a character:
// Specification UCS-2 rule, here can only read the first double byte value, read an error code point value
'𠮷'.charCodeAt(0) / / 55362
// Use the string API of the ECMAScript 2015 specification
'𠮷'.codePointAt(0) / / 134071
Copy the code
Refer to the link
- Utf-8 and UTF – 16
- Unicode and JavaScript in detail
- A brief introduction to UTF-8 coding
- rfc3629
- rfc2781