Introduction of Emoji

Db characters (Japanese: concrete words/db characters) are visual emotionsused in wireless communications in Japan. Emoji refer to pictures, and words arecharacters that can represent a variety of emotions-smiling faces for laughter, cakes for food. In mainland China, emojis are commonly known as “little yellow faces,” or emoji. In NTTDoCoMo’s i-mode phone system, emojis measure 12×12 pixels, and when transmitted, each image is two bytes long. The Unicode encoding is E63E to E757, while the shift-JIS encoding is F89F to F9FC. There are 176 symbols in the basic emojis, and another 76 emojis were added in the programming language of C-HTML4.0. It was originally created by Shigetaka Kurita and became popular among Internet and mobile phone users in Japan. Since Apple’s iOS 5 input method was added to emoji, emojis have swept the world. Now emoji have been adopted by the Unicode code compatible with most modern computer systems, and are widely used in various mobile phone text messages and social networks.

The above quote comes from Baidu Baike, which states that “a graph has 2 bytes and Unicode encoding ranges from E63E to E757”. However, human creativity is infinite, limited areas can not meet the desire to express people. So Emoji aren’t limited to two bytes, and humans are making more and more rules about this.

But the limited rules are always accompanied by two problems — compatibility and extensibility, how to filter out unsupported Emoji, and how to extend more Emoji.

The core issue is how Emoji coding rules work.

Emoji coding

View Unicode encoding and UTF-8 encoding under MAC

Press CTRL + CMD + space to display the Emoji keyboard and click on the upper right corner.

Click on the Settings – Custom list in the upper left corner.

Select the Unicode.

Now we can select Emoji to see Unicode and UTF-8 code.

As you can see, Unicode is written as U+1F436, and UTF-8 takes up four bytes.

If you click on a few more Emoji, it’s not easy.

The Chinese flag takes up two Unicode code blocks, and UTF-8 takes up eight bytes.

The gay Emoji UTF-8 has taken over… Not to mention, there is more than one Unicode code point (a concept mentioned later).

Even more interestingly, the number of bytes is different after tanning.

So what are Unicode and UTF-8? To understand this problem, go back to ASCII.

ASCII

ASCII (American Standard Code for Information Interchange) is a computer coding system based on the Latin alphabet used to display modern English and other Western European languages. It is the most common standard for information exchange and is equivalent to the international standard ISO/IEC 646. ASCII was first published as a canonical standard in 1967, and was last updated in 1986, with 128 characters defined so far.

The STORAGE space occupied by the ASCII code of one character is 1 byte. So in theory 2^8 = 256 characters.

Standard ASCII, also known as base ASCII, uses only the last seven bits, or 128 characters, leaving the highest bit (B7) for verification.

Although 128 is enough to represent all the everyday characters in English, the French phonetic symbol e, for example, is not enough, so some European countries also use the highest digit to represent other symbols.

Generally speaking, the ASCII symbols 0 to 127 are the same, 128 to 255 May be different. For example, 130 stands for E in French, Gimel (Χ’) in Hebrew, and another symbol in Russian.

As for the extensive and profound Chinese, there are more characters. One byte is not enough to represent all Chinese characters, so GBK encoding adopts two bytes.

In the same way, human creativity is endless. Emoji, ripple, and so on, so many other encoding methods have been created that require more and more bytes.

However, characters from 0 to 127 are recommended to be compatible with standard ASCII characters.

Unicode

Unicode is an industry standard in computer science, including character set, encoding scheme, etc. Unicode was created to overcome the limitations of traditional character encoding schemes. It provides a unified and unique binary encoding for each character in each language to meet the requirements of cross-language and cross-platform text conversion and processing. It was developed in 1990 and officially announced in 1994. — Baidu Encyclopedia

Unicode code: Unicode code is an international standard encoding that uses two bytes and is incompatible with ASCII code. — Baidu Encyclopedia

As you can see, Unicode includes character sets, encoding schemes, and so on; Two byte encoding is used.

Some concepts of Unicode

Character set, code point

The Character set (Unicode) is a code table that specifies the one-to-one correspondence between characters and numbers.

When designing a character set, the first step is to determine the number of characters needed and determine the list of characters needed. Depending on the number of characters, you can set the upper limit of integer values, which is called the code space. In the Unicode standard, the integers in the code space range from 0 to 10FFFF(a specific integer in the code space is called a code point), with a total of 1,114,112 code points available.

Then, each character in the character list is assigned an integer value, which is a code point. This results in a Character Set, called a Coded Character Set.

When writing code points for Unicode characters, they are usually preceded by a prefix U+, and the numeric part is represented by a hexadecimal value of four to six digits. For example, the code point of character “A” in Unicode is U+0041.

The plane

The Unicode encoding space ranges from 0 to 10FFFF and can be divided into 17 planes of characters, each containing 2^16 and 64K code points.

  • Plane 0 (U+ 0000-U +FFFF) is known as the Basic Multilingual Plane (BMP), also known as the zeroth Plane, which contains those characters that are frequently used.
  • Plane 1 (U+ 10000-U +1FFFF) is called Supplementary Multilingual flat lingual Plane (SMP), also known as the first plane. It contains some of the less commonly used alphabet systems, such as Deseret.
  • Plane 2 (U+20000 – U+2FFFF) is called Supplementary Ideographic Plane (SIP), also known as the second Plane. It contains ideographic characters (such as Chinese characters), most of which are not often used.
  • Plane 14 (U+E0000 – U+EFFFF) is called Supplementary Special-purpose Plane(SSP).
  • Planes 15 and 16 (U+ f0000-u +10FFFF) are Private Use Planes. Add U+E000 – U+F8FF to form Unicode’s Private Use Area (PUA). This area is reserved by Unicode for users. Unicode does not assign characters to these code points. Applications can add their own characters to this area.
  • None of the other planes have been used yet.

Unicode conversion format: UTFs

UTF stands for Unicode Transformation Format and can be translated into Unicode Character Set Transformation Format, which is how to convert Unicode-defined numbers into program data.

We should have seen the encoding of UTF-8, UTF-16, utF-32. The number of bytes they occupy is not fixed. Let me give you an example.

Utf-8 typically encodes one to four bytes per character, but up to six bytes can be used.

128 ASCII characters (Unicode range from U+0000 to U+007F) are a single byte, For Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, and Maldivian with diacritics (Unicode range U+0080 to U+07FF) requires two bytes, Characters in other basic Multilanguage planes (BMP) (CJK falls into this category -Qieqie note) use three bytes, and characters in other Unicode auxiliary planes use four bytes.

Utf-8 encoding rules are simple, with only two:

  1. For single-byte symbols, the first byte is set to 0 and the next 7 bits are the Unicode code for the symbol. So utF-8 encoding is the same as ASCII for English letters.

  2. For n-byte symbols (n>1), the first n bits of the first byte are set to 1, the n+1 bits are set to 0, and the first two bits of the following bytes are set to 10. The remaining bits, not mentioned, are all unicode codes for this symbol.

The following table:

The number of bytes express Unicode symbol range
1 0xxxxxxx 0000 0000 – 0000 007F
2 110xxxxx 10xxxxxx 0000 0080 – 0000 07FF
3 1110xxxx 10xxxxxx 10xxxxxx 0000 0800 – 0000 FFFF
4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 0001 0000 – 0010 FFFF
5 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 0020 0000 – 03FF FFFF
6 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 0400 0000 – 7FFF FFFF

Explain a little bit.

  • Utf-8 1 byte is used to represent 128 ASCII characters, so the Unicode symbol ranges from bits 0-7F, or 0-127. Other analogies.

  • The actual number of bits in UTF-8 that can be used to represent character encodings is up to 31 bits, the number of all x’s in 6 bytes.

Unicode and UTF-8 conversions

Take Yan for example. Unicode is 4E25(1001110 00100101).

As you can see from the table above, 4E25 is in the range of the third row (0000 0800-0000 FFFF), so the UTF-8 encoding for “yan” requires three bytes, i.e., “1110XXXX 10XXXXXX 10XXXXXX”.

Then, starting with the last bit of the “strict”, fill in the x in the format from back to front, and fill in the extra bit 0. So you get, “11100100 10111000 10100101”, which in hexadecimal is E4B8A5.


To sum up.

Unicode is an encoding table that contains all characters.

UTF8 is a “re-encoding” method for transmitting Unicode, encoding Unicode and sending it over the network.

A Unicode code may be converted to a UTF8 code of one byte length (ASCII), or two (Latin, etc.), three (Chinese, etc.), or four (auxiliary plane, bytes).

If the text is mostly ASCII characters, using UTF8 encoding saves resources (Unicode 2 bytes -> UTF8 ASCII 1 bytes).


Unicode dynamic composition and default characters

Remember at the beginning when you saw that some Emoji didn’t consist of a Single Unicode code point?

“Characters” are far more complex than code points, and a single character may consist of multiple code points.

Dynamic combination

Unicode includes a system for dynamically combining characters by merging multiple encoding points. The system increases flexibility in various ways without causing huge combinatorial bloat of code points.

If Unicode tried to assign different code points for every possible combination of letters and diacritics, things would quickly get out of hand. In contrast, dynamic composition systems can specify diacritics by starting with base characters and appending additional code points called “composite characters”, and finally constructing the desired characters. When the text renderer sees a sequence like this in a character Z string, it will automatically stack diacritics above or below the base letter to create a composite character. For example, the accented character ‘A’ can be expressed as A string of two code points: U + 0041 ‘A’ Latin capital letter A plus U + 0301 ‘PATTERN’ combines the accent. The string is automatically rendered as A single character: “A”.

The combined flag system does allow any number of diacritics to be superimposed on any underlying character.

Using reductive Zalgo text, it creates confusion by randomly superimposing any number of diacritics on each letter, causing it to overflow the line spacing. (As shown below)

There are several concepts involved.

  • Base character: In writing, a character that is not combined with the preceding character and is neither a control character nor a format character.
  • Combining character: in writing, a character that combines with a previous base character. The compound character is applied to the base character.

Although combination characters are used for display in combination with base characters, there are two possible situations: (1) there is no base character before the combination character; (2) Combined operations cannot be performed during processing. In both cases, the processing may display combined characters without merging them in writing.

In a code table, the representation of composite characters is drawn by dotted circles. When displayed in combination with the preceding base character, the base character appears at the position of the dotted circle.

  • A sequence of characters consisting of a base character followed by one or more combination characters, or a sequence of one or more combination characters.

The default character

Today, Unicode also contains A number of “preset” code points, each representing A used combination, such as U+00C1 “A” for the sharp Latin capital letter A or U+1EC7 “t” for the lowercase Latin letter E with A trochee and A lower dot. In fact, there are presets for most of the diacritic letters common in European languages, so dynamic combinations are rarely used in text.

It is assumed that these default characters have been added to some versions of Unicode character sets (but there is no evidence to support this).

Dynamic combination and default character equivalence problem

In Unicode, default characters and dynamic composition systems coexist. The consequence is that there are multiple ways to represent the same string — different sequences of encoding points produce the same user-aware character. For example, to represent the character “A” as we saw earlier, we could use A single code point U+00C1, or two code points U+0041 and U+0301. To solve this problem of equivalent strings, Unicode defines several formal normalization methods. Such as NFD and NFC, because this part is more complicated (temporarily do not understand) will not be described.


Word a cluster

As you can see, Unicode includes multiple cases where what a user thinks of as a “character” may actually consist of multiple code points underneath. Unicode uses the concept of “byte clusters” to represent this situation. A string of one or more code points constitutes a “user-aware character”.

UAX #29 defines precise rules for word bundles. It’s about “a basic coding point followed by any number of combined tokens”, but the actual definition is a bit more complicated; It contains the Korean alphabet and emoji ZWJ sequence.

Byte clusters are primarily used for text editing: they are the most obvious units for cursor and text selection. Use bit clusters to make sure that you don’t suddenly drop symbols when copying and pasting text, that the left and right arrow keys always move one visible character away, and so on.

Another place to use word clusters is to enforce string length restrictions — such as in a database domain. In fact, the underlying restriction might be something like the length of bytes in UTF-8. You can’t simply limit the length by truncating bytes. At a minimum, you have to “drop” the nearest code point; Better yet, discard the nearest byte cluster. Among other things, you can break a character, break a jamo sequence or a ZWJ sequence, by dropping one of its phonetic symbols.

Emoji uses the ZWJ sequence.

Emoji splicing implementation

Now we can try to understand the implementation of Emoji splicing.

It’s essentially a set of coding rules that are set up to be spliced when matching.

  • The flag is an Emoji composed of two Unicode code points

Refer to the Unicode region descriptor.

A certain interval field is specified to describe the flag. When the text recognizer supports this matching rule, the code points matching this interval are automatically read and merged.

  • Multi-unicode uses the concatenation character.

Join multiple code points using the zero-width connector ZWJ U+200D. But it’s actually displayed as an Emoji.

Take a good look at that Emoji, with a lot of U+200D.

The minimum is 3 Unicode. The longest is up to 7 Unicode.

On unsupported systems, multiple emojis are displayed. Below is the display effect of the same Emoji in the Markdown editor and the rich text editor in a software.

Emoji in iOS string

Above from Unicode has been introduced to the coding of Emoji, that Emoji in iOS daily development pit?

The length and the range

  • The concept of length

Let’s start with some code.

    NSString *string = @"πŸ‘¨ ‍ ❀ ️ ‍ πŸ’‹ ‍ πŸ‘¨";
    NSLog(@"%lu", string.length);
Copy the code

The output above is 11.

If you look at the documentation, Apple uses UTF-16 encoding to calculate the string length.

  • The concept of the range

One more piece of code, we pass in the second position, expecting to get an A.

    NSString *string = @"πŸ˜€ a.";
    NSLog(@"%hu", [string characterAtIndex:1]);
Copy the code

Instead of getting a, the index of a is actually 2, the third byte in UTF-16 encoding.

So there is the concept of range, the actual range of regions displayed after decoding the rules supported by the current version.

Get the true range for special characters, in case you split the same character.

NSRange is a commonly used structure in Foundation framework. Its definition is as follows:

typedef struct _NSRange { NSUInteger location; // Indicates the starting position of the range NSUInteger length; } NSRange;Copy the code
  • Index and range conversion

Apple provides some apis to convert them. Pass in an index or range to get the full range.

- (NSRange)rangeOfComposedCharacterSequencesForRange:(NSRange)range;
- (NSRange)rangeOfComposedCharacterSequenceAtIndex:(NSUInteger)index;
- (NSRange)rangeOfString:
Copy the code

So when dealing with strings, to get the verbal number of characters, you should use range:

    NSString *string = @"πŸ˜€ a.";
    NSRange characterRange = NSMakeRange(2, 1);
    NSLog(@"% @", [string substringWithRange:characterRange]);
Copy the code
  • Get the character that shows position X

Apple doesn’t offer this feature directly.

But we can define an array to hold every single character that we show.

    NSMutableArray *displayCharArray = [NSMutableArray array];
    NSString *string = @"πŸ˜€ πŸ‘© 🏽 πŸ‘¨ ‍ πŸ‘¨ ‍ πŸ‘§ ‍ πŸ‘¦ πŸ‘© ‍ πŸ‘© ‍ πŸ‘§ ‍ πŸ‘¦ πŸ‘© ‍ πŸ‘© ‍ πŸ‘§ πŸ‘© ‍ πŸ‘© ‍ πŸ‘¦ ‍ πŸ‘¦ πŸ‘ͺ πŸ‘¨ ‍ ❀ ️ ‍ πŸ’‹ ‍ πŸ‘¨ πŸ‘© ‍ ❀ ️ ‍ πŸ‘© πŸ‘¨ ‍ ❀ ️ ‍ πŸ‘¨ πŸ‘¬";
    [string enumerateSubstringsInRange:NSMakeRange(0, string.length) options:NSStringEnumerationByComposedCharacterSequences usingBlock:^(NSString * _Nullable substring, NSRange substringRange, NSRange enclosingRange, BOOL * _Nonnull stop) {
        [displayCharArray addObject:substring];
        NSLog(@"% @", substring);
    }];
    
    NSLog(@"The sixth display character is %@", displayCharArray[5]);
Copy the code


reference

  • Emoji combat problem: iOS, Android, Server
  • Unicode character encoding standard
  • Combining character
  • A Programmer’s Introduction to Unicode