Unicode encoding in Android

A unicode

Unicode is an industry standard in computer science, including character set, encoding scheme, etc. Unicode was created to overcome the limitations of traditional character encoding schemes. It provides a unified and unique binary encoding for each character in each language to meet the requirements of cross-language and cross-platform text conversion and processing. It was developed in 1990 and officially announced in 1994.

Encoding range: 0x0000-0x10FFFF.

The range is divided into 17 planes, of which plane 0 (0x0000-0xFFFF) is the basic multilingual plane containing the English alphabet, Arabic numerals, and languages commonly used in time. Plane 1 (0x10000-0x1FFFF) is a multi-lingual supplement plane, including the supplement of different languages and emoji. Plane 2 (0x20000- 0x2FFFF) Ideographic supplementary plane, this whole plane is specially designed for CJK, namely Chinese, Japanese and Korean Characters encoding. Plane 14 (0xe0000-0xeffFF) Special purpose supplementary plane, this plane is currently encoded very little. Plane 15 and Plane 16 (0xf0000-10FFFF) Private use area, that is, custom coding plane. The other planes are largely uncoded. This plane division is readily available on Wikipedia, with a link below.

Plane 0’s private area and proxy area are uncoded as follows:

Special area: 0xe000-0xf8FF, contains 6400 code points

Private areas are reserved for custom coding.

Proxy area: 0xD800-0xDFFF, 2048 code points

The proxy area is for utF-16 encoding schemes, as described below.

emoji:1F600-1F64F,2600-26FF,…

Chinese characters: 4E00-9FFF, extension area A:3400-4DBF, extension area B-F: 20000-2EbeF

For example: Chinese character ‘run ‘-0x6da6,emoji’ 😆 ‘-0x1f606

Unicode coding implementation

UTF:Unicode Transformation Format, which is the Unicode character set Transformation Format. There are three implementations of Unicode: UTF-8, UTF-16, and UTF-32

1, utf-8:

With 8 bits as a unit, the representation of a character is at least one unit, i.e. one byte, and at most four units, i.e. four bytes. The number of units used depends on the character’s code point in the Unicode character set, as shown in the table below.

Unicode encoding (hexadecimal)	Utf-8 byte stream (binary)
000000-00007F	0xxxxxxx
000080-0007FF	110xxxxx 10xxxxxx
000800-00FFFF	1110xxxx 10xxxxxx 10xxxxxx
010000-10FFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

2, the utf – 16:

With 16 bits as a unit, a character is represented in either one unit, two bytes, or two units, four bytes. Unicode encodings less than 0x10000 are represented by one unit, and 0x10000-0x10FFFF are represented by two units.

A.U <0x10000, utF-16 corresponds to the unsigned integer of the encoding itself

b. U>=0x10000,

7 word formula is: one minus, two turns, three templates.

1). Reduced: U ‘= U – 0 x10000

Example: 0 x1f606-0 x1000

2). Convert: convert U’ to 20-bit binary, yyyy YYYY YYxx XXXX XXXX

3). Template: 1101 10YY YYYY yyyy 1101 11XX XXXX XXXX

I could do the math, 1101 10YY YYYY 1101 1000 0000 0000-1101 1011 1111 1111 d800-DBFF, The value of XXXX is 1101 1100 0000 0000-1101 1111 1111 1111

Namely DC00 – DFFF. Therefore, D800-DFFF is unicode’s proxy region for UTF-16, where D800-DBFF is the high proxy region and DC00-dFFf is the low proxy region. There are 2048 code points in the proxy area, and Unicode is designed as a reserved area, without specific coding, only for the proxy.

As we have learned above, plane 15 and plane 16 (0xf0000-0x10FFFF) are private use zones. We can calculate the coding of this zone using the above formula. 0xf0000-0x10000=e0000 = 11100000000000000000 per person, put it into the template. 1101 1011 1000 0000(DB80) 1101 1100 0000 0000(DC00) 0x10FF-0x10000 = 0xffffF, Add the template 1101 1011 1111 1111(DBFF) 1101 1111 1111 1111(DFFF). Db80-dbff (a sub-interval of D800-DBFF high-level proxy area) is also called high-level special proxy area because the high-level proxy implementation range of UTF-16 for plane 15 and 16 is db80-DBFF.

Instead, UTF-16 encodings unicode

A. Check: In non-proxy area, this character corresponds to U<0x10000

B. the agent area

1). Template :1101 10YY YYYY YY YY 1101 11XX XXXX XXXX

2). Convert: convert YYYY YYYY YYxx XXXX XXXX to hexadecimal U’

3.) add: U = U + 0 x10000.

3, utf – 32

A unit of 32 bits, or 4 bytes. All Unicode encodings can be represented directly, and there are far more code points to be used. The implementation is simple, but is rarely used because two bytes is enough for most characters and four bytes is too space-wasting.

4. Comparison of UTF-8 and UTF-16 schemes

1000 Chinese characters, 1000 English contrast

Create a text file and enter 1000 common Chinese characters (4e00-9FFF) randomly. Find that it is 3KB in size. Because Windows Notepad is encoded in UTF-8 by default, and a Chinese character in UTF-8 takes three bytes, 1000 characters is about 3KB. We saved the text in UTF-16 format and found the size changed to 2KB. This is because UTF-16 means Chinese characters need only 2 bytes, 1000 Chinese characters are about 2KB. If you replace 1000 Chinese characters with 1000 English letters or Arabic numbers, you will find that in UTF-8 format only 1KB, while UTF-16 is still 2KB. This is because utF-8 can be expressed in one unit (1 byte) for an English letter or an Arabic numeral, and UTF-16 must also be expressed in at least one unit (2 bytes). This comparison can help us intuitively understand the difference between UTF-8 and UTF-16.

And because most of the world’s common word code points are in the 0x100-0xFFFF range, the best option is UTF-16. The design of Java Char is a good illustration. Java char Contains 2 bytes. The default encoding is UTF-16. Of course, if it is pure English and Arabic numerals program to choose UTF-8 encoding is also understandable.

5. Byte order

When I save the text document as UTF-16, I find that the exact format is UTF-16LE or UTF-16BE.

BE(Big Endian) or LE(Little Endian)

Note: byte order, non-bit order; Byte order within a unit, not all byte order. Take a closer look at the utF-16 and UTF-32 implementations of 0x6da6 and 0x1F606 to see what this means.

Unicode	UTF-16LE	UTF-16BE	UTF32-LE	UTF32-BE
0x6da6	a6 6d	6d a6	a6 6d 00 00	00 00 6d a6
0x1f606	3D D8 06 DE	D8 3D DE 06	06 f6 01 00	00 01 f6 06

Utf-8 is one byte per unit, so there is no distinction between size and byte order.

UTF encoding	Byte Order Mark (BOM)
UTF-8 without BOM	There is no
UTF-8 with BOM	EF BB BF
UTF-16LE	FF FE
UTF-16BE	FE FF
UTF-32LE	FF FE 00 00
UTF-32BE	00 00 FE FF

Feff-zwnbsp: zero-width no-break Space

Microsoft prefixes its utF-8 text files with EF BB BF,

Programs like Notepad on Windows use these three bytes to determine whether a text file is ASCII or UTF-8,

However, this is just a token Microsoft secretly makes, and it’s not necessarily the case for UTF-8 text files on other platforms.

Some Microsoft software does this detection, but some software does not and treats it as a normal character. (Legendary garble problem)

Check whether the string contains emoji

1, requirements,

As long as the input content contains Spaces, pepper emoji, emoji or more than 10 characters, the association is not related, otherwise, the change in the number of characters is associated.

Requirement breakdown: Whether the input contains emoji

2, implementation,

public static boolean containsEmoji(String str) {

    int len = str.length();

    for (int i = 0; i < len; i++) {

        int codePoint = Character.codePointAt(str, i);

        if (isEmojiCharacterByWiki(codePoint)) {

            return true;

        }

    }

    return false;

}

Copy the code

/ * *

* Whether the code is emoji

* //Superscripts and Subscripts(2070-209f)

* //Currency Symbols(20A0-20cf

 * //Combining Diacritical Marks for Symbols( 20D0 - 20FF )

* //Letterlike Symbols(2100-214F) c, ™

* // The Number Forms, one-third

* //Arrows(2190-21FF) →

* //Mathematical Operators(2200-22ff

* //Miscellaneous Technical(2300-23FF

* //Control Pictures(2400-243f

* //Optical Character Recognition(2440-245F

* //Enclosed Alphanumerics(2460-24FF)

 * //Box Drawing( 2500 - 257F )

 * //Box Elements( 2580 - 259F )

* //Geometric Shapes(21A0-21FF

* //Miscellaneous Symbols(2600-26FF

* //Dingbats (2700-27bf) decorated logo

* //Miscellaneous Mathematical Symbols-A (27C0-27EF)

* /Supplemental Arrows-A (27f0-27ff) Append Arrows

* //Braille Patterns (2800-28ff)

* /Supplemental Arrows -b (2900-297f) Append Arrows

* //Miscellaneous Mathematical Symbols-B(2980-29FF

* //Supplemental Mathematical Operators(2a00-2aff

* //Miscellaneous Symbols and Arrows(2b00-2bFF)

 * <p>

* //CJK Symbols and Punctuation(3000-30ff

 * <p>

* //Enclosed CJK Letters and Months(3000-30ff)

 * <p>

* // Mahjong Tiles(1F000-1F02F

* // Domino Tiles(1f030-1F09f

* // Playing Cards(1F0A0-1f0ff

* //Enclosed Alphanumeric Supplement(1F100-1F1FF

* // enclose Ideographic Supplement (1F200-1F2FF)

* //Miscellaneous Symbols and Pictographs (1F300-1F5FF

 * //Emoticons ( 1F600 - 1F64F )

 * //Ornamental Dingbats ( 1F650 - 1F67F )

 * //Transport and map symbols ( 1F680 - 1F6FF )

* //Alchemical Symbols (1f700-1F77F

* //Geometric Shapes Extended (1F780-1F7FF

* /Supplemental Arrows -c (1f800-1f8ff) Append Arrows -c

* /Supplemental Symbols and Pictographs (1f900-1f9ff

* //Chess Symbols (1FA00-1FA6F

 * https://en.wikibooks.org/wiki/Unicode/Character_reference

 *

 * @param codePoint

 * @return

* /

private static boolean isEmojiCharacterByWiki(int codePoint) {

    return ((codePoint >= 0X2070) && (codePoint <= 0X2BFF)) ||

            ((codePoint >= 0X3000) && (codePoint <= 0X30FF)) ||

            ((codePoint >= 0X3200) && (codePoint <= 0X32FF)) ||

            ((codePoint >= 0x1F000) && (codePoint <= 0x1FA6F));

}

Copy the code

Reference:

1.When the specification for the Java language was created, the Unicode standard was accepted and the char primitive was defined as a 16-bit data type, with characters in the hexadecimal range from 0x0000 to 0xFFFF.

When the Java language standard was created, the Unicode standard was adopted. The char primitive type was defined as a 16-bit data type that could store characters in hexadecimal notation from 0x0000 to 0xFFFF.

https://docs.oracle.com/javase/tutorial/i18n/text/unicode.html

2.The native character encoding of the Java programming language is UTF-16.

https://docs.oracle.com/javase/8/docs/api/java/nio/charset/Charset.html

3. Wikipedia’s arrangement of Unicode encodings

Zh.wikibooks.org/wiki/Unicod…

4. Emoji coding classification

https://apps.timwhitlock.info/emoji/tables/unicode

This article is formatted using MDNICE