A unicode
Unicode is an industry standard in computer science, including character set, encoding scheme, etc. Unicode was created to overcome the limitations of traditional character encoding schemes. It provides a unified and unique binary encoding for each character in each language to meet the requirements of cross-language and cross-platform text conversion and processing. It was developed in 1990 and officially announced in 1994.
Encoding range: 0x0000-0x10FFFF.
The range is divided into 17 planes, of which plane 0 (0x0000-0xFFFF) is the basic multilingual plane containing the English alphabet, Arabic numerals, and languages commonly used in time. Plane 1 (0x10000-0x1FFFF) is a multi-lingual supplement plane, including the supplement of different languages and emoji. Plane 2 (0x20000- 0x2FFFF) Ideographic supplementary plane, this whole plane is specially designed for CJK, namely Chinese, Japanese and Korean Characters encoding. Plane 14 (0xe0000-0xeffFF) Special purpose supplementary plane, this plane is currently encoded very little. Plane 15 and Plane 16 (0xf0000-10FFFF) Private use area, that is, custom coding plane. The other planes are largely uncoded. This plane division is readily available on Wikipedia, with a link below.
Plane 0’s private area and proxy area are uncoded as follows:
Special area: 0xe000-0xf8FF, contains 6400 code points
Private areas are reserved for custom coding.
Proxy area: 0xD800-0xDFFF, 2048 code points
The proxy area is for utF-16 encoding schemes, as described below.
emoji:1F600-1F64F,2600-26FF,…
Chinese characters: 4E00-9FFF, extension area A:3400-4DBF, extension area B-F: 20000-2EbeF
For example: Chinese character ‘run ‘-0x6da6,emoji’ 😆 ‘-0x1f606
Unicode coding implementation
UTF:Unicode Transformation Format, which is the Unicode character set Transformation Format. There are three implementations of Unicode: UTF-8, UTF-16, and UTF-32
1, utf-8:
With 8 bits as a unit, the representation of a character is at least one unit, i.e. one byte, and at most four units, i.e. four bytes. The number of units used depends on the character’s code point in the Unicode character set, as shown in the table below.
Unicode encoding (hexadecimal) | Utf-8 byte stream (binary) |
---|---|
000000-00007F | 0xxxxxxx |
000080-0007FF | 110xxxxx 10xxxxxx |
000800-00FFFF | 1110xxxx 10xxxxxx 10xxxxxx |
010000-10FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
2, the utf – 16:
With 16 bits as a unit, a character is represented in either one unit, two bytes, or two units, four bytes. Unicode encodings less than 0x10000 are represented by one unit, and 0x10000-0x10FFFF are represented by two units.
A.U <0x10000, utF-16 corresponds to the unsigned integer of the encoding itself
b. U>=0x10000,
7 word formula is: one minus, two turns, three templates.
1). Reduced: U ‘= U – 0 x10000
Example: 0 x1f606-0 x1000
2). Convert: convert U’ to 20-bit binary, yyyy YYYY YYxx XXXX XXXX
3). Template: 1101 10YY YYYY yyyy 1101 11XX XXXX XXXX
I could do the math, 1101 10YY YYYY 1101 1000 0000 0000-1101 1011 1111 1111 d800-DBFF, The value of XXXX is 1101 1100 0000 0000-1101 1111 1111 1111
Namely DC00 – DFFF. Therefore, D800-DFFF is unicode’s proxy region for UTF-16, where D800-DBFF is the high proxy region and DC00-dFFf is the low proxy region. There are 2048 code points in the proxy area, and Unicode is designed as a reserved area, without specific coding, only for the proxy.
As we have learned above, plane 15 and plane 16 (0xf0000-0x10FFFF) are private use zones. We can calculate the coding of this zone using the above formula. 0xf0000-0x10000=e0000 = 11100000000000000000 per person, put it into the template. 1101 1011 1000 0000(DB80) 1101 1100 0000 0000(DC00) 0x10FF-0x10000 = 0xffffF, Add the template 1101 1011 1111 1111(DBFF) 1101 1111 1111 1111(DFFF). Db80-dbff (a sub-interval of D800-DBFF high-level proxy area) is also called high-level special proxy area because the high-level proxy implementation range of UTF-16 for plane 15 and 16 is db80-DBFF.
Instead, UTF-16 encodings unicode
A. Check: In non-proxy area, this character corresponds to U<0x10000
B. the agent area
1). Template :1101 10YY YYYY YY YY 1101 11XX XXXX XXXX
2). Convert: convert YYYY YYYY YYxx XXXX XXXX to hexadecimal U’
3.) add: U = U + 0 x10000.
3, utf – 32
A unit of 32 bits, or 4 bytes. All Unicode encodings can be represented directly, and there are far more code points to be used. The implementation is simple, but is rarely used because two bytes is enough for most characters and four bytes is too space-wasting.
4. Comparison of UTF-8 and UTF-16 schemes
1000 Chinese characters, 1000 English contrast
Create a text file and enter 1000 common Chinese characters (4e00-9FFF) randomly. Find that it is 3KB in size. Because Windows Notepad is encoded in UTF-8 by default, and a Chinese character in UTF-8 takes three bytes, 1000 characters is about 3KB. We saved the text in UTF-16 format and found the size changed to 2KB. This is because UTF-16 means Chinese characters need only 2 bytes, 1000 Chinese characters are about 2KB. If you replace 1000 Chinese characters with 1000 English letters or Arabic numbers, you will find that in UTF-8 format only 1KB, while UTF-16 is still 2KB. This is because utF-8 can be expressed in one unit (1 byte) for an English letter or an Arabic numeral, and UTF-16 must also be expressed in at least one unit (2 bytes). This comparison can help us intuitively understand the difference between UTF-8 and UTF-16.
And because most of the world’s common word code points are in the 0x100-0xFFFF range, the best option is UTF-16. The design of Java Char is a good illustration. Java char Contains 2 bytes. The default encoding is UTF-16. Of course, if it is pure English and Arabic numerals program to choose UTF-8 encoding is also understandable.
5. Byte order
When I save the text document as UTF-16, I find that the exact format is UTF-16LE or UTF-16BE.
BE(Big Endian) or LE(Little Endian)
Note: byte order, non-bit order; Byte order within a unit, not all byte order. Take a closer look at the utF-16 and UTF-32 implementations of 0x6da6 and 0x1F606 to see what this means.
Unicode | UTF-16LE | UTF-16BE | UTF32-LE | UTF32-BE |
---|---|---|---|---|
0x6da6 | a6 6d | 6d a6 | a6 6d 00 00 | 00 00 6d a6 |
0x1f606 | 3D D8 06 DE | D8 3D DE 06 | 06 f6 01 00 | 00 01 f6 06 |
Utf-8 is one byte per unit, so there is no distinction between size and byte order.
UTF encoding | Byte Order Mark (BOM) |
---|---|
UTF-8 without BOM | There is no |
UTF-8 with BOM | EF BB BF |
UTF-16LE | FF FE |
UTF-16BE | FE FF |
UTF-32LE | FF FE 00 00 |
UTF-32BE | 00 00 FE FF |
Feff-zwnbsp: zero-width no-break Space
Microsoft prefixes its utF-8 text files with EF BB BF,
Programs like Notepad on Windows use these three bytes to determine whether a text file is ASCII or UTF-8,
However, this is just a token Microsoft secretly makes, and it’s not necessarily the case for UTF-8 text files on other platforms.
Some Microsoft software does this detection, but some software does not and treats it as a normal character. (Legendary garble problem)
Check whether the string contains emoji
1, requirements,
As long as the input content contains Spaces, pepper emoji, emoji or more than 10 characters, the association is not related, otherwise, the change in the number of characters is associated.
Requirement breakdown: Whether the input contains emoji
2, implementation,
public static boolean containsEmoji(String str) {
int len = str.length();
for (int i = 0; i < len; i++) {
int codePoint = Character.codePointAt(str, i);
if (isEmojiCharacterByWiki(codePoint)) {
return true;
}
}
return false;
}
Copy the code
/ * *
* Whether the code is emoji
* //Superscripts and Subscripts(2070-209f)
* //Currency Symbols(20A0-20cf
* //Combining Diacritical Marks for Symbols( 20D0 - 20FF )
* //Letterlike Symbols(2100-214F) c, ™
* // The Number Forms, one-third
* //Arrows(2190-21FF) →
* //Mathematical Operators(2200-22ff
* //Miscellaneous Technical(2300-23FF
* //Control Pictures(2400-243f
* //Optical Character Recognition(2440-245F
* //Enclosed Alphanumerics(2460-24FF)
* //Box Drawing( 2500 - 257F )
* //Box Elements( 2580 - 259F )
* //Geometric Shapes(21A0-21FF
* //Miscellaneous Symbols(2600-26FF
* //Dingbats (2700-27bf) decorated logo
* //Miscellaneous Mathematical Symbols-A (27C0-27EF)
* /Supplemental Arrows-A (27f0-27ff) Append Arrows
* //Braille Patterns (2800-28ff)
* /Supplemental Arrows -b (2900-297f) Append Arrows
* //Miscellaneous Mathematical Symbols-B(2980-29FF
* //Supplemental Mathematical Operators(2a00-2aff
* //Miscellaneous Symbols and Arrows(2b00-2bFF)
* <p>
* //CJK Symbols and Punctuation(3000-30ff
* <p>
* //Enclosed CJK Letters and Months(3000-30ff)
* <p>
* // Mahjong Tiles(1F000-1F02F
* // Domino Tiles(1f030-1F09f
* // Playing Cards(1F0A0-1f0ff
* //Enclosed Alphanumeric Supplement(1F100-1F1FF
* // enclose Ideographic Supplement (1F200-1F2FF)
* //Miscellaneous Symbols and Pictographs (1F300-1F5FF
* //Emoticons ( 1F600 - 1F64F )
* //Ornamental Dingbats ( 1F650 - 1F67F )
* //Transport and map symbols ( 1F680 - 1F6FF )
* //Alchemical Symbols (1f700-1F77F
* //Geometric Shapes Extended (1F780-1F7FF
* /Supplemental Arrows -c (1f800-1f8ff) Append Arrows -c
* /Supplemental Symbols and Pictographs (1f900-1f9ff
* //Chess Symbols (1FA00-1FA6F
* https://en.wikibooks.org/wiki/Unicode/Character_reference
*
* @param codePoint
* @return
* /
private static boolean isEmojiCharacterByWiki(int codePoint) {
return ((codePoint >= 0X2070) && (codePoint <= 0X2BFF)) ||
((codePoint >= 0X3000) && (codePoint <= 0X30FF)) ||
((codePoint >= 0X3200) && (codePoint <= 0X32FF)) ||
((codePoint >= 0x1F000) && (codePoint <= 0x1FA6F));
}
Copy the code
Reference:
1.When the specification for the Java language was created, the Unicode standard was accepted and the char primitive was defined as a 16-bit data type, with characters in the hexadecimal range from 0x0000 to 0xFFFF.
When the Java language standard was created, the Unicode standard was adopted. The char primitive type was defined as a 16-bit data type that could store characters in hexadecimal notation from 0x0000 to 0xFFFF.
https://docs.oracle.com/javase/tutorial/i18n/text/unicode.html
2.The native character encoding of the Java programming language is UTF-16.
https://docs.oracle.com/javase/8/docs/api/java/nio/charset/Charset.html
3. Wikipedia’s arrangement of Unicode encodings
Zh.wikibooks.org/wiki/Unicod…
4. Emoji coding classification
https://apps.timwhitlock.info/emoji/tables/unicode
This article is formatted using MDNICE