In the face of character coding, after all kinds of information and wiki query, want to write a summary of their own notes, half of the writing found that a big guy (blog: Clumsy Lin) has done a better summary of the author’s original text. So I directly copied and reproduced here (also combined with other article content), there will be some details optimization supplement, so as to clarify the thinking, word-for-word peruse and refer to at any time. A lot of the links are Wikipedia, which requires science to open.

Try to understand the system of “character coding” (due to the number of uploaded characters, divided into five), no longer worry.

Full text:

  • Character Encoding (I: Terms and origin of character Encoding)
  • Character coding (II: Simplified Chinese character coding and ANSI coding)
  • Character encoding (III: Unicode encoding system and byte order)
  • Character encoding (4: UTF series encoding details)
  • Character coding (five: Network transmission coding Base64, percent coding)

1.0 introduction

Character coding is one of the most fundamental and important topics in the computer world. However, it is often skimmed over in computer textbooks, and there is not even a book devoted to in-depth introduction. In the practice of programming, if you do not carry forward the spirit of death to the end to figure out the origin and origin of the character coding problem, the past life, then it will be lingering like a ghost, resulting in a variety of character coding related “supernatural” events from time to time tortured to death.

Character coding is fundamental and important mainly because it involves a wide range of fields. Down to the underlying technology of the computer, and even hardware implementation; Upward is closely related to almost any operating system, programming language, and application.

Other topics as basic, important, widely used, and particularly confusing as character encoding are byte ordering (that is, large-and-small-side representation), regular expressions, and floating-point implementations, date-time processing, and so on. The relationship between byte order and regular expression and character encoding is closely related, especially byte order, which directly affects the byte sequence of character encoding. Since regular expressions are mainly used to find and extract characters or substrings from strings, it is necessary to have a deep understanding of character encoding to truly understand regular expressions.

1.1 Key terms

To really understand the character coding problem, we must start from the basic concepts of computer – bit, byte, character set, character coding and so on, and then combine different system environment and programming environment, specific analysis.

Personally added base related content, not included in the original text.

Bits, bit

Bit, namely Bit (Bit), also known as binary Bit, Bit, Bit, refers to the binary number of a Bit, is the smallest unit of information representation in the computer.

Bit is a portmantedge of the Binary digit, coined by mathematician John Wilder Tukey (probably in 1946, but some sources claim it was invented as early as 1943). The first formal use of the term was in Shannon’s famous essay, A Mathematical Theory of Communication, on page 1. Bits (bits) are conventionally represented by a lowercase letter B. For example, 8 bits can be represented as 8B.

Each ratio has two possible values, 0 and 1, which, in addition to representing the value itself, can also represent:

  • Positive and negative values;
  • Two states, such as the light on, off, a wire current, no, and so on;
  • Yes and no, or true and false, in abstract logic.

Byte, 8-bit group

In computers, it is common to use a series of bits, called a bit string. Obviously, computer systems don’t let you use bits of arbitrary length, they let you use bits of a particular length.

Some common bitstring length forms have conventional names. For example, nibble represents a combination of four bits, and byte represents a combination of eight bits. There are word, Double word, Quad word, Ten byte, Tbyte, and so on.

Byte, also known as a bit in Hong Kong, Macao and Taiwan, is a basic unit of measurement used to measure storage capacity and transmission capacity in computers. It is a bit string consisting of a continuous and fixed number of bits, usually consisting of eight bits. That is, 1 byte equals 8 bits. Bytes are conventionally represented with a capital B, so they can also be represented as 1 B = 8 B.

The modern personal computer (PC) of memory addressing, generally in bytes, called by the byte addressing, so general is the smallest unit of access of the memory and processor byte of the smallest unit of addressable (there is also a bitwise addressing, addressing, and so on according to the word, but not widely used on personal computers, there is no discussion).

The important feature of byte as the smallest access unit of memory and the smallest addressing unit of processor is closely related to character encoding. For example, the single byte and multi-byte of code element, the big endian and small endian of byte order are closely related to the basic data type based on byte (see the introduction later).

It is customary to arrange the bits of a byte in the following order, that is, from right to left, from least significant (bit 0) to most significant (bit 7) :

Note that bytes don’t have to be 8 bits. There have been 4, 6, 7, 12, or 18 bits as a byte standards before, such as IBM 701 (36 bits word length, 18 bits a byte), IBM 702 (7 bits word length, 7-bit is a byte), CDC 6600 (60-bit word length, 12-bit is a byte), etc. It’s just that the de facto standard for modern computers is to use eight bits to represent a byte. (The most important reason for this, besides historical and commercial reasons, is that the power of two is easier to compute.)

It is for this reason that in many of the more rigorous technical specification literature, the term Octet rather than Byte is used to emphasize 8-bit strings to avoid ambiguity.

However, since bytes are generally understood as 8-bit groups of 8 bits, the 8-bit groups are generally referred to as bytes, or bytes are generally referred to as 8-bit groups, unless otherwise specified.

Ps: high position, highest bit and high byte appear frequently in the following text; Terms such as low, lowest, and low bytes, among which “high” and “low” are relative concepts. In human reading and writing, binary numbers, counting from left to right, are high on the left and low on the right. See “1.7 byte order” for a detailed explanation.

Words, words long

Although the byte is the smallest unit of storage and transmission for most modern computers, it is not the most efficient unit of data that a computer can process.

In general, the data size that a computer can most efficiently process should be the same size as the word length, which brings us to the concept of word and word length.

  • Word: In a computer, a string of bits (bit string, bit string) that is processed or operated on as a unit is called a computer word, or word for short. Words are usually divided into several bytes.
  • Word length: the length of a word, which refers to the number of digits contained in each word of a computer. The word length determines the actual number of bits processed by the CPU in a single operation. The word length is determined by the data bus width of the CPU’s external data path.

The speed at which a computer processes data is obviously related to how many bits it can process at a time and how quickly it can perform operations. If one computer is twice the word length of the other and the two computers are running at the same speed, the former can usually do twice as much work as the latter in the same amount of time. Therefore, word length has a great relationship with the function and use of the computer, and is an important technical index of the computer.

At present, the desktop platform processor word length is in the transition period from 32 to 64 bits, embedded devices basically stable at 32 bits, and in some professional fields (such as high-end graphics cards), processor word length has already reached 64 bits or even more 128 bits.

Base notation

Character encoding is essentially a mapping between characters and binary (except ASCII code, other are not direct mapping, additional conversion calculation is needed), but binary is too unintuitive for human to remember, so there are other more conducive to human understanding and memory of the base representation. It is important for non-computer science majors to understand the common bases (binary, octal, hexadecimal) and how they are represented in different environments, systems, and languages.

Base, or carry counting, is an artificially defined method of counting with carry (there are counting without carry, such as the original knot count, the “positive” word count often used to tally votes, and the similar Tally mark). For any base, the X base, it means that the number in every position is carried one place in X. Decimal is every ten, hexadecimal is every hex, binary is every two, and so on, x is every x.

binary

In mathematical and digital circuits, a system of numeration having base 2, which indicates that the system is binary. In this system, two different symbols, 0 (for zero) and 1 (for one), are commonly used. The realization of logic gates in digital electronic circuits directly uses binary, so modern computers and computer-dependent devices use binary. Each number is called a bit.

Binary is already the most basic language for computers, with zeros and ones everywhere.

Why computers use binary:

First, the binary counting system uses only two digits: 0 and 1, so any element with two different stable states can be used to represent one digit of a number. There are many components that actually have two distinct stable states. For example, the “on” and “off” of a neon lamp; Switch “on” and “off”; Voltage “high” and “low”, “positive” and “negative”; There are “holes” and “no holes” on paper tape, “signal” and “no signal” in electric circuits, the north and south poles of magnetic materials, and so on. It’s easy to use these very different states to represent numbers. What’s more, the two distinct states are not only quantitatively different, but also qualitatively different. In this way, the anti-interference ability and reliability of the machine can be greatly improved. It is much more difficult to find a simple and reliable device that can represent more than two states.

Secondly, the four rules of binary counting are very simple. And all four operations boil down to addition and shift, so that the arithmetic circuit in the electronic computer becomes very simple. Not only that, the line is simplified, so the speed can be increased. This is also not comparable to the decimal system.

Third, the use of binary representation of numbers in electronic computers can save equipment. It can be proved theoretically that the most efficient equipment is to use the ternary system, followed by the binary system. However, most electronic computers still use binary system because binary system has advantages that other binary systems, including ternary system, do not. In addition, since there are only two symbols “0” and “1” in binary, Boolean algebra can be used to analyze and synthesize the logical circuit in the machine. This provides a very useful tool for designing electronic computer circuits.

octal

Octal, abbreviated OCT or O, a counting system based on eight digits, 0,1,2,3,4,5,6,7, carried by one of the eight digits. Some programming languages often begin with the number 0 to indicate that the number is octal. Octal numbers and binary numbers can correspond bitwise (octal digits correspond to binary three digits), so they are often used in computer languages.

Octal notation is so common in computer systems that we sometimes see people using octal notation. Since a hexadecimal digit can correspond to four binary digits, it is convenient to represent binary in hexadecimal. Therefore, octal is not as useful as hexadecimal. Several programming languages provide the ability to represent numbers in octal notation, and there are some older Unix applications that use octal.

In programming languages, octal literals are often combined with different prefixes, including the number 0, the letter O or q, the 0O combination, or the symbol & or dollar.

For example, the word 73 (octal) can be represented as 073, o73, q73, 0o73, \73, @73, &73, $73 or 73o in various languages.

New languages (ECMAScrip, etc.) have abandoned the prefix 0 because o and 0 are too easily misidentified and confused. Generally, the prefix 0o is used.

The prefix The environment
0o ECMAScript6, C, Python3.0, Ruby, etc

hexadecimal

The hexadecimal system (abbreviated hex or subscript 16) is in mathematics a system of numbers of 1 to 16. It is usually represented by the numbers 0 to 9 and the letters A to F, where A – F corresponds to 10 to 15 decimal numbers, which are called hexadecimal numbers.

For example, the decimal number 57 is written 111001 in binary and 39 in hexadecimal.

Today, hexadecimal is commonly used in computers because it is not difficult to convert four bits into a single hexadecimal number. A Byte can be represented as two consecutive hexadecimal digits. However, this mixed notation is confusing and requires some prefix, suffix, or subscript to display.

Different computer systems and programming languages have different expressions for hexadecimal numbers. The most common (or common) way to express hexadecimal numbers is to add 0x before the number or add the small character 16 after the number.

The environment format note
URL %hex
XML and XHTML &#xhex
HTML, CSS, #hex Used to indicate color
Unicode U+hex Representation character encoding
UTF-8 \+hex A subset of Unicode,
UTF-16 \uhex A subset of Unicode, a 16-bit universal character name
UTF-32 \Uhex A subset of Unicode, 32-bit universal character names
MIME =hex
Modula-2 #hex
Smalltalk, ALGOL 68 16rhex
Common Lisp # xhex or # 16 rhex
IPv6 8 hex:separated
Intel Assembly language hexh
Other assembler $hex
In VB, etc. &Hhex

C, C++, Shell, Python, Java, and other similar languages use the prefix “0x”, such as “0x5A3”. The initial “0” makes it easier for the parser to recognize numbers, while the “x” represents hexadecimal (just as the “O” represents octal). The “x” in “0x” can be uppercase or lowercase.

Conversion between bases is self-explanatory.

Encoding and decoding

Encode, is the process of converting information from one form to another, such as converting characters (words, numbers, symbols, etc.), images, sounds, or other objects into specified electrical impulses or binary digits by pre-specified methods.

Decode (Decode) is the reverse process of coding.

Character set

Character Set (Character Set, Charset), literally understood as a collection of characters, is a Set of all characters supported by the natural language writing system. Character is the general name of various characters and symbols, including characters, numbers, letters, syllables, punctuation marks, graphic symbols, etc.

The ASCII character set, for example, defines 128 characters; The GB2312 character set defines 7445 characters. A character set is precisely an ordered set of numbered characters (but not necessarily contiguous, as in EASCII, more on that later).

Common character sets include ASCII character set, ISO 8859 series character set (ISO 8859-1-8859-16), GB series character set (GB2312, GBK, GB18030), BIG5 character set, Unicode character set, etc.

As shown in the figure, Microsoft expands GBK (guo-biao Kuozhan) on the basis of GB2312, and then GBK becomes a “national standard”. But there is also information on the Internet that there was GBK first (formulated by the National Technical Committee of Information Technology Standardization on December 1, 1995), and then Microsoft in its internal use of CP936 code page based on GBK expansion, That is, the CP936 code page in Windows is actually an implementation of the GBK encoding scheme (the author prefers the latter).

A character encoding

Character Encoding, also known as Character set Encoding, is the process of Encoding characters in a Character set in a certain way to an object in a specified set (for example, bit patterns, sequences of natural numbers, 8-bit groups, or electrical impulses), so that text can be stored in a computer and transmitted over a communication network.

The process of establishing a mapping between a character set and a specified set. This is a basic technology in information processing.

Thus, characters are usually defined by character sets, while computer-based information processing systems use combinations of different states of electronic components (i.e., hardware) to represent, store, and process characters.

Electronic components (hardware) of different states (generally) for opening and closing two states combination can represent the number of the digital system (such as opening and closing on behalf of the 0 s and 1 s of binary), so the process of the character encoding is also can be understood as to map character conversion process of binary digits for the computer can accept, This makes it easy for characters to be represented, stored, processed, and transferred (including over networks) in a computer.

A common example is the encoding of the common Latin alphabet into Morse code and ASCII code. ASCII numbers letters, numbers, and other symbols, and represents this number directly in a 7-bit binary number in a computer. An additional expanded bit “0” is usually added to the highest (that is, the first) bit so that the computer system can process, store, and transfer it in exactly 1 byte (8 bits).

Traditional character encoding model

Character Encoding Model is a Model framework that reflects the structural characteristics of Character Encoding system and the relationship between its components.

For historical reasons, early character sets and character encodings were generally considered synonyms and did not require a strict distinction. Thus, in traditional character encoding models represented by simple character sets such as ASCII, the two concepts are almost equivalent.

In the traditional character encoding model, the characters in the character set are numbered (the character number is usually no more than one byte after being converted to binary number), and then the character number is the character encoding (this sentence will be explained in detail later).

Modern character encoding model

The modern character encoding model represented by Unicode and UNIVERSAL Character Set (UCS, or ISO/IEC 10646) does not directly adopt the simple character set encoding idea like ASCII (that is, the traditional character encoding model), but adopts a new encoding idea.

This new encoding idea breaks down the concept of character set and character encoding into the following aspects:

  • What characters are there;
  • What are the numbers of these characters;
  • How these numbers are encoded into a series of logically finite numbers, namely a sequence of symbols;
  • How these logical sequences of symbols are translated into (that is, mapped to) physical sequences of bytes (that is, byte streams);
  • In some special transmission environments (such as Email), the byte sequence is further adaptively encoded.

Taken as a whole, these aspects form the modern character coding model – Wikipedia.

The core idea behind the modern character coding model’s decomposition into these aspects is to create a universal character set that can be encoded in different ways. Notice the key words here: “different way” and “universal”.

This means that the same character set can be used in different encoding methods; That is, different encoding methods can be used to encode the same character set. The relationship between character set and encoding mode can be one-to-many.

Furthermore, in the traditional character encoding model, character encoding mode and character set are closely combined. In modern character encoding model, character encoding is decoupled from character set. In software engineering jargon, this is to decouple the previously tightly coupled character encoding from the character set.

Therefore, in order to correctly represent this modern character encoding model, more precise conceptual terms than “character set” and “character encoding” are needed to describe it.

In Unicode Technical Report (UTR Technical Report) #17 Unicode CHARACTER ENCODING MODEL, the modern CHARACTER ENCODING MODEL is divided into five levels and more conceptual terms are introduced to describe it:

  • Level 1, Abstract Character Repertoire (ACR) : defines the range of characters (that is, determines which characters are supported);
  • At the second level, Coded Character Set CCS (Coded Character Set) : characters are represented by numbers (that is, characters in ACR are numbered by numbers); Ps: This is the modern character encoding model of the “character set”
  • At the third layer, CEF (Character Encoding Form) encodes Character numbers into logical code sequences (logical Character Encoding); Ps: UTF series
  • Layer 4, Character Encoding Scheme (CES) : the logical code sequence is mapped to the physical byte sequence (physical Character Encoding); Ps: byte stream running on the local computer
  • Layer 5, Transfer Encoding Syntax (TES) : further adaptive Encoding of byte sequences. Ps: For network transmission, the local byte stream is specially encoded, such as Base64, percentage code (that is, URL code).

Ps: The following is a detailed explanation of the five levels of modern character encoding model. I suggest that you skip this paragraph for the first time, read it first and then go back to understand it.

1 Abstract character table ACR

Abstract character table ACR is a collection of all abstract characters supported by the encoding system, which can be simply understood as an unordered collection of characters, used to determine the range of characters, that is, which characters to support.

An important feature of the abstract character table ACR is its unordered character, that is, the characters are not arranged numerically, and therefore not numbered numerically.

“Abstract” characters do not have a particular glyph and should not be confused with “concrete” characters that have a particular glyph.

The character table can be closed (that is, the character range is fixed), that is, no new characters can be added unless a new standard is created, as is the case with the ASCII character table and ISO/IEC 8859 series. Character tables can also be open (that is, character ranges are not fixed), that is, new characters can be added continuously, such as Unicode character tables and to some extent Code pages (explained later) are examples of this.

2 Coded character set CCS

Generally, “Coded Character Set” is translated as “Coded Character Set” or “Coded Character Set”, but the word “Coded” here is easily confused with the word “Coded” in “encoding mode” and “encoding mode” in the following text, resulting in difficulties in understanding. Therefore, it is advisable to translate it as “numbered character set”.

As mentioned above, the characters in the abstract character table are not arranged in order, but the unordered abstract character table can only judge whether a character belongs to a certain character table, but cannot conveniently refer to or refer to a specific character in the character table.

In order to reference and refer to characters in the character table easily, each character in the abstract character table must be numbered.

The so-called character numbering is to represent each abstract character (abbreviated character) in the abstract character table ACR as a non-negative integer N or map to a coordinate (non-negative integer value pair x, Y), that is, to map the set of abstract characters to a set of non-negative integer or non-negative integer value pair. The result of the mapping is the numbered character set CCS. Therefore, the character number is the non-negative integer code of the character.

For example, in A given abstract character table, the character representing the uppercase Latin letter “A” is assigned A non-negative integer 65, the character “B” is 66, and so on.

Thus comes the concept of Code Space (generally translated as Code Space, Code Space, Code point Space) : Depending on the number of abstract characters in the abstract character table, you can set a character number upper limit (which is usually set to be greater than the total number of characters in the abstract character table). The range of non-negative integers from 0 to this upper limit is called the number space.

Description of numbering space:

  1. Can be described by a pair of non-negative integers, for example: GB2312 Chinese character number space is 94 x 94;

  2. It can also be described as a non-negative integer, for example: isO-8859-1 has a number space of 256;

  3. Iso-8859-1 is an 8-bit numbering space (2^8=256);

  4. It can also be described in terms of subsets, such as rows, columns, planes, and so on.

A Position in the numbering space is called a Code Point or a Code Position. The coordinate (non-negative integer value pair) of the code point occupied by a character, or the non-negative integer it represents, is the number of the character, also known as the code point value (i.e. the code point number).

However, ** Strictly speaking, character numbers are not exactly the same as code point numbers (that is, code point values). ** Because in fact, for some special reason, the number of code points in the numbered character set CCS is larger than the number of characters in the abstract character table ACR.

In the numbering character set, besides character code points, there are also non-character code points and reserved code points, so character numbering is not as accurate as code point numbering. However, it is more direct to call the code point numbers of character code points character numbers.

In Unicode encoding schemes, character code points are also called Unicode Scalar Value (which is not more intuitive than character code points). Non-character code points and reserved code points are described in the following paragraphs.

** Note that ** often refers directly to “codepoint value” as “codepoint” — when referring to codepoint, it is possible to refer to a position in the numbering space (i.e. codepoint space) as a codepoint; It’s also possible that you’re actually talking about the code point value of a code point, the code point number. The exact meaning depends on the context.

Such references are common, and the concepts of “character set,” “character number,” and “character encoding” often refer to each other. The term “code point” refers to “code point value”, which is not difficult to understand in the context. But the terms “character set”, “character numbering” and “character encoding” are often referred to each other, and although there are frustrating historical reasons for this, the current reality is confusing and maddening.

Therefore, the so-called numbered character set can be simply understood as the result of numbering abstract characters one by one or mapping them one by one to code point values (i.e. code point numbering).

Numbered character set CCS is often referred to simply as a character set.

Be careful not to confuse the numbered character set CCS with the abstract character table ACR. Multiple different numbered character sets CCS can represent the same abstract character table ACR. In other words, the same abstract character table ACR can be numbered into different numbered character sets CCS according to different numbering rules.

In the Unicode standard, it is possible for a single abstract character to correspond to multiple code points (for compatibility with other standards, for example, the code points U+51C9 and U+F979 are actually the same character “cool” for compatibility with the Korean character set standard KS X 1001:1998, See the Unicode documentation for details), it is also possible to use a sequence of code points (to represent a character consisting of a base character combined with a composite character, such as a, It consists of the basic character letter “A”, whose code point is U+0061, and the combined character pronunciation symbol “̀”, whose code point is U+0300); At the same time, not every code point corresponds to a character, there are non-character code points or reserved code points. See the explanation below.

Special note: although the “number” and later “coding” of some character encodings way CEF (namely the symbol sequence, explain see later) and the “code” a character encoding mode CES (or sequence of bytes, explain see later) there is a corresponding relationship, but * * “number” and “code” is a completely different two concepts. * *

The process of character numbering, that is, determining the value of character code points, is not directly related to the computer, but can be regarded as a pure mathematical problem, because only the character and the number (that is, code point value, code point number) one by one correspondence, there is no coding algorithm (that is, has not been involved: Numbers are encoded to form code sequences according to the specified character encoding mode CEF, and code sequences are further encoded to form byte sequences according to the specified character encoding mode CES.

However, these two concepts are often used interchangeably, which is a source of confusion, which is a very sad and regrettable reality.

3 Character encoding CEF

When I talked about abstract character table ACR, I said that, unlike the traditional closed ASCII character table with a fixed number of characters, Unicode character table is a modern open character table with an unfixed number of characters, and more characters may be added in the future (for example, many Emoji have been added continuously). Therefore, the number of code points required by the Unicode numbered character set is necessarily ever-increasing and relatively infinite.

But the range of integers that computers can represent is relatively limited. For example, an unsigned single-byte integer (unsigned char, uint8) can represent only 0~0xFF numbers, a total of 256; An unsigned double byte short integer (unsigned short, Uint16) can represent only 0~0xFFFF, 65536 in total; An unsigned four-byte integer (unsigned long, uint32) contains 4294967296 numbers that range from 0 to 0xFFFFFFFF.

So here’s the question:

  1. On the one hand, how can a relatively limited number of integers be used to be scalable and adaptable to a relatively unlimited number of characters in the future? Is it to be expressed indirectly with multiple single-byte integers, or directly with a sufficiently large multi-byte integer?

  2. On the other hand, ASCII character encoding, as one of the earliest and most widely used encoding schemes, is obviously unwise to be completely incompatible. Is it direct or indirect?

These two problems require a comprehensive solution, which is CEF (Character Encoding Form).

CEF converts the code point value (code point number, character number) of the characters in the numbered character set into or encodes the code value of limited bit length (character code). The Code value is actually a Code Unit Sequence.

So what is a code element? Why introduce the concept of symbols? How are character character numbers (code point values, code point numbers) in a character set translated into character codes (code element sequences) in a computer? Don’t worry, write down the concept here, not to delve into the details of the following article.

CEF is also known as Storage Format. However, calling CEF a storage format is actually a misnomer because CEF is still only a logical, platform-independent way of coding, not a physical, platform-specific way of storing (layer 4 only).

In ASCII, such a traditional and simple character coding system, there is no need to distinguish between character number and character coding. It can be considered that character number is character coding, and there is a direct mapping relationship between character number and character coding.

However, in modern and complex character encoding systems like Unicode, it is necessary to distinguish between character numbering and character encoding. Character numbering is not necessarily equal to character encoding, and there is not necessarily a direct mapping between character numbering and character encoding. For example, UTF-8 and UTF-16 are indirect mappings, while UTF-32 is a direct mapping.

Utf-8, UTF-16, and UTF-32 are common character encodings in the Unicode character set (the numbered character set). (UTF-8, UTF-16, and UTF-32 are described later.)

In many articles, the relationship between Unicode and UTFS (Unicode/UCS Transformation Format, including UTF-8, UTF-16, and UTF-32) is expressed as: Unicode is the standard specification, and UTFS are the encoding implementation.

Loosely speaking, one can take this view reluctantly. However, this statement is not rigorous enough and does not help to understand the deeper relationships between the Unicode standard (i.e., the Unicode encoding scheme, the Unicode encoding system), the Unicode character set, and the various UTF character encodings.

Note: From the perspective of modern character encoding model, Unicode encoding standard, Unicode encoding scheme and Unicode encoding system are basically synonyms. It is a set of standard scheme system, which includes five levels of abstract character table ACR, numbered character set CCS, character encoding mode CEF, character encoding mode CES and even transmission encoding syntax TES, and does not refer to any one level.

4 character encoding mode CES

Generally, “Character Encoding Scheme” is translated as “Character Encoding Scheme”, but it is customary to refer to “Character Encoding standard” or “Character Encoding system” as “Character Encoding Scheme”. In order to avoid confusion in understanding, it is appropriate to translate “Character Encoding Scheme”.

The character encoding pattern CES, also known as Serialization Format, refers to the mapping of encoded sequences of character numbers into the first form of a sequence of ** bytes (that is, byte streams) so that encoded characters can be processed, stored, and transmitted in a computer.

If the process of mapping code point values (i.e., character numbers) of a numbered character set to (i.e., encoding to) a sequence of symbols is a logical encoding process (i.e., the encoding process of layer 3 character encoding CEF) independent of a particular computer system platform, The process of mapping a sequence of symbols to a sequence of bytes is a physical encoding process (the encoding process of the level 4 character encoding pattern CES) that is specific to a particular computer system platform.

Due to the historical reasons of hardware platform and operating system design, utF-16, UTF-32 and other multi-byte code encoding methods, You must specify byte-order (Byte Order, Byte Order, and Byte Order) using a character (the Unicode character number 0xFEFF) originally called ZERO WIDTH no-break SPACE. Or Endianness endian) is the big-endian or small-endian order that the computer processes, stores, and transfers correctly. (What is byte order and big endian, little endian, explain for details later)

However, for utF-8, which uses single-byte codes, there is no byte order problem and no need to specify the byte order. As a result, utF-8 encoded sequences of bits and bytes are identical across all computer systems. (Why utF-8 does not have byte order problems is explained later.)

Note that both CEF and CES have the word “Encode”, so it is possible that the verb encoding (Encode) commonly referred to refers to the conversion of character numbers of coded character set CCS into code sequences through CEF; It is also possible to refer to the transformation of the code sequence of the character encoding mode CEF into a sequence of bytes through the character encoding mode CES.

The verb Decode goes the other way, but there are also two possibilities.

The common term Encoding, in turn, can refer to sequences of symbols, as well as byte sequences. (Of course, in many articles, noun encodings actually describe character numbers, not sequences of bits or bytes, which, despite historical reasons, counts as a misrepresentation, or at least a lack of rigor, from the point of view of modern character encoding models.)

Therefore, it must be understood in context.

For programmers, by way of a character encoding element which is formed by the CEF after coding sequence, is more of a logical sense in the middle of the code (namely encoding mediation, belong to from the character to the numbering sequence of bytes in the middle of the state, as the character number into a sequence of bytes of mediation), not directly targeted to “deal with” at ordinary times.

The byte sequence formed by further encoding the code element sequence through the character encoding mode CES is the final encoding in the physical sense with the most direct “dealing” at ordinary times (although in fact there is the encoding formed by TES, the transmission encoding grammar of layer 5, but this encoding is only used in some special transmission environment after all, Less chance to “deal with”).

5 Transfer encoding syntax TES

For historical reasons, further adaptive encoding of byte sequences (byte streams) provided by the upper-level character encoding pattern CES is required in some special transmission environments. Generally, there are two types:

  1. One is to map byte sequences to a more restricted set of values to meet the constraints of the transmission environment, such as Base64 encoding for Email transmission or quoted-printable encoding, which maps 8-bit bytes to 7-bit data.

    Note: The Email protocol is designed to transmit only 7-bit ASCII characters; From the point of view of an 8-bit byte, the first digit of an ASCII character is always 0, so excluding the first 0, the actual number of valid digits is 7. Perhaps the Email protocol was originally designed to transmit only 7-bit ASCII characters to save traffic.

  2. The other is to compress the values of byte sequences, such as LOssless compression techniques such as LZW or process length encoding.

summary

To summarize the modern character encoding model:

For modern character encoding systems like Unicode, the same character has multiple different Code Unit sequences due to multiple different character encoding ceFS (such as UTF-8, UTF-16, UTF-32, etc.). The same code Sequence may have two different Byte sequences due to two different character encoding schemes CES (big-endian mode, small-endian mode).

However, these different sequences of symbols and bytes are generally the same as long as they represent the same character. (In the Unicode standard, a few characters may correspond to more than one code point for compatibility with other standards.)

1.2 Origin of character encoding

The computer was originally an American invention. Computers were first invented to solve numerical problems, but later people found that computers could do more things, such as text processing.

But the computer only “knows” 010110111000… Such binary digits composed of 0 and 1, this is because the underlying hardware implementation of the computer is to use the circuit open and closed two states to represent 0 and 1 two numbers. Therefore, computers can only store and process binary numbers directly.

In order to represent, store, and process characters like words, symbols, and so on on a computer, it is necessary to convert these characters into binary digits.

Of course, it is certainly not how we want to convert the conversion, otherwise it will cause the same segment of binary numbers on different computers displayed different characters, so we must set a unified standard for conversion.

The standard for this conversion, the character coding standard, was designed.

The evolution of the story

To get a preview of the evolution of character encodings, skip the next section:

As a complex thing that can calculate, store and communicate, the most basic function of a computer should be to be able to read what humans tell it to do. So they had to build a language that computers could understand, that people could understand, in order to communicate.

Binary and Hex

What language do computers use? The language is very simple, with only two binary numbers, 0 and 1 (because computers use high and low levels to represent 1 and 0, respectively). 0 indicates no, and 1 indicates yes. Through combinations of zeros and ones, and operations between zeros and ones (bits), computers can understand the world, analyze it, and help people do their jobs.

But zeros and ones are so simple that any simple number can be represented by a long string of zeros and ones. For example, if a computer is told to remember the number 1000 (decimal), it has to remember a long string of numbers 1111101000 (binary). Computers are easy to remember, but humans can’t… Is there a way to make the data represented by a computer shorter and easier to remember?

Humans are used to using the decimal system (Dev). After all, humans have fingers, and ten ones are nice! But let the computer use decimal, it is not realistic, because decimal to binary, too troublesome (decimal to binary method, you can refer to the relevant information).

So, in order to take care of both, both base 8 and base 16 are acceptable. In base 8, use only the numbers 0 to 7. The first 10 digits are in the hexadecimal (or Hex code) range from 0 to 9, followed by A, B, C, D, E, and F, and are case insensitive. Most of them are mostly hexadecimal.

Hex coding principle

The Hex code works by grouping a long string of binary numbers into groups of four, with zeros added at the top if there are not enough digits. There are only 16 cases of A four-digit number, which are represented by 0-9 and a-F (case insensitive) respectively. The encoding table looks like this:

Encoding (binary) Characters (hexadecimal) Encoding (binary) Characters (hexadecimal)
0 0 1000 8
1 1 1001 9
10 2 1010 A
11 3 1011 B
100 4 1100 C
101 5 1101 D
110 6 1110 E
111 7 1111 F

More readable ASCII encoding

While the Hex code is good, there’s just one problem: Opening a file from your computer with an eyeful of hexadecimal numbers is a lot of work. Hexadecimal is still not very good for text. Is it possible to create a method that represents all the English characters and symbols typed by the keyboard? Characters that cannot be typed by the keyboard, such as carriage returns and placeholders, are represented by special symbols. In this way, open a file, full of English, is not very refreshing ~

The United States as the originator of the computer country, naturally to launch a such standard code table. This is the American Standard code for Information Interchange, or ASCII Code Table. This code table contains numbers, English case, symbols, and various escape characters, and can contain all the functions used in English. ASCII soon became the international standard, and all other forms of coding known today are compatible with ASCII.

What about other languages? Unicode and utf-8

The English world is happy with the ASCII table… Other countries’ heads hurt… What about the phonetic symbols? What about Japanese and Korean? What about the most extensive and profound Chinese… As a result, countries developed codes for their own languages (ANSI codes, as described in more detail later, are collectively referred to as ASCII supersets, extended ASCII, etc.). However, for general use in computer systems, these codes are generally compatible with ASCII codes.

The best known is UTF-8 (an implementation of the Unicode character set). This code is also known as wanguo code, as the name implies, is to support including Simplified Chinese, Traditional Chinese, Japanese, Korean and other languages of the code.

What happens when you open one code with another code?

Now that each country has its own code table, the problem arises. Now that it’s all internationalized, what happens if I use one encoding system that supports my native language and open text encoded by another encoding system? This is garbled… What’s more, with the advent of the Internet, computers in every country need to communicate, and one way to communicate is to use URLS. Each country wants the address written in its own language. But that would make it impossible for other countries to access the address because you can’t type that character. So there was an urgent need for an intermediate form of encoding that was compatible with ASCII and could convert either form into something that could be represented using only readable characters (printable characters in ASCII, more on that later). One form of encoding is Base64 encoding.

The evolution of character encodings ends, but back to the original text.

EBCDIC code

The character encoding standard designed at the beginning is EBCDIC encoding standard. EBCDIC, short for Extended Binary Coded Decimal Interchange Code (Extended Binary Coded Decimal Interchange Code).

EBCDIC code was developed and designed by INTERNATIONAL Business Machines Corporation (IBM) for mainframe operating systems and was introduced in 1963-1964.

In EBCDIC code, the English letters are not arranged consecutively, and there are many breaks in the middle, which causes some difficulties for programmers.

As a result, EBCDIC was not used in IBM’s personal computer and workstation operating system. Instead, THE ASCII encoding scheme, introduced later than EBCDIC and later became the industry standard for English character encoding, was used.

ASCII

ASCII (American Standard Code for Information Interchange), It was published by THE American National Standard Institute (ANSI) in 1963 and formally formulated in 1968, also known as US-ASCII code or basic ASCII code.

After that, ASCII coding standard was adopted by ISO/IEC in 1972, and formulated as ISO/IEC 646 standard (ISO, that is, International Standardization Organization, founded in 1946; IEC, the International Electrotechnical Commission, was founded in 1906. ISO/IEC is often used to refer to standards jointly developed by these two international organizations. Thus, ISO/IEC 646 (often referred to simply as ISO 646) and ASCII refer to the same coding standard.

Ps: ASCII code is not completely equivalent to ISO 646, only that most countries are compatible with ASCII, and some countries are not. ISO 646 standard. It is a 7-bit character set from several national standards. ISO 646 in addition to the English letters and numbers part of all countries the same, other can be in accordance with the actual needs, the ISO 646 to make suitable for the country’s text modification. And because the 8-bit character set was not universally accepted (specifically in ISO 2022), countries put different letters or symbols into their character sets, so that some of the letters or symbols that appear in ASCII do not appear in some countries’ ISO 646 variants – Wikipedia.

Since ASCII appeared later than EBCDIC (there are also articles on the Internet that say ASCII was designed before EBCDIC, but ASCII was not officially adopted as a standard until 1968), the encoding method of ASCII was based on EBCDIC and learned from its experience. The English letters are arranged consecutively, which facilitates the process of the program.

Ps: ASCII code was published in 1963 (meaning they were working on it before 1963), just not standardized. EBCDIC was developed between 1963 and 1964. So PERSONALLY I think ASCII is older.

ASCII encoding scheme is not the earliest character encoding scheme, but it is the most basic, the most important, the most widely used character encoding scheme.

Other current character encoding schemes, such as ISO 8859 series, GB series (GB2312, GBK, GB18030, GB13000), Big5, Unicode and so on, are directly or indirectly compatible with ASCII code.

And coding schemes that are completely incompatible with ASCII, such as EBCDIC, are basically obsolete or about to become obsolete.

ASCII character encoding scheme introduction

ASCII uses seven binary digits (bits) to represent a single character, representing a total of 128 characters (2^7 = 128, the binary code is 0000 0000 to 0111 1111, and the decimal equivalent is 0 to 127).

At present, computers generally use 8 bits as a byte to access and process, so the remaining highest bit (i.e. the 8th bit) of the 1 bit is generally 0, but sometimes in some communication systems is also used as a parity bit.

Ps: when ASCII was just announced in 1963, the protocol in the field of communication used the 8th bit (the highest bit) for the purpose of checking and correcting errors. However, for computer memory, error checking becomes unnecessary and is not used much (and does not represent actual characters). Thus 8-bit character encodings gradually emerged to represent more characters than ASCII (i.e., ASCII supersets). — From Wikipedia — ISO/IEC 2022 — From the introduction.

The ASCII character set contains 128 characters (see table above). The code point numbers (i.e., the character numbers (Dec in decimal and Hex) range from 0 to 127 (0000 0000 to 0111 1111 in binary and 0x00 to 0x7F in hexadecimal). The highest bits of binary are all zeros. Among them:

  • Code points Dec 0 to 31: Do not display non-printable control characters or communication-specific characters, such as 0x07 (BEL ring), which causes the computer to beep, 0x00 (NUL null, Note not Spaces) are commonly used to indicate the end of a string, 0x0D (CR return) and 0x0A (LF newline) are used to indicate that the printer’s print needle falls back to the head of the line (i.e. return) and moves to the next line (i.e. newline), etc.

    Note: It may seem strange to call these control or communication-specific characters “characters”, but what these so-called “characters” actually represent is an action or behavior and therefore cannot be displayed or printed.

  • Code point number Dec 32: space character that can be displayed but cannot be printed;

  • Code point number Dec 33 to 126: printable characters can be displayed, including 0 to 9 Arabic digits from 48 to 57, 26 uppercase letters from 65 to 90, 26 lowercase letters from 97 to 122, and some punctuation marks and operation symbols.

  • Code point number Dec 127: Cannot display the non-printable control character DEL.

By this time the character of decoding is very simple, such as if the character sequence coding to binary stream into a storage device, only need the character sequence in the different characters in the ASCII characters in a character set number (that is, the code number), directly to a binary bytes can be written to a storage device, character number is character encoding, There is no need to go through special encoding algorithm to calculate the conversion from character number to character code, let alone the so-called conversion from code sequence to byte sequence.

ASCII codes combine different binary zeros and ones into code point numbers. This is the most direct binary and character mapping relationship, which does not need to be converted. But binary can be expressed in decimal (Dec), hexadecimal (Dec) ah, after all, humans are more intuitive to decimal, binary to the machine.

The problem of ASCII

In English, 128 symbols is enough to represent everything, but in other languages, 128 symbols is not enough. In French, for example, letters with phonetic symbols above them cannot be represented in ASCII. So some European countries decided to use the highest bits of the byte that were unused to encode new symbols. For example, the decimal code of E in French is 130 (binary 10000010). As a result, the coding system used in these European countries can represent up to 256 symbols.

But here comes a new problem. Different countries have different letters, so even though they all use the 256-symbol code, they don’t represent the same letters. For example, 130 stands for E in French, Gimel (ג) in Hebrew, and another symbol in Russian. But anyway, in all of these codes, the symbols from 0 to 127 are the same, except for the 128-255 part.

As for Asian writing, there are even more symbols, with about 100,000 Chinese characters. It is surely not enough that a single byte can represent only 256 symbols, so you must use more than one byte to represent a symbol. For example, the common encoding method for simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so it can theoretically represent up to 256 x 256 = 65536 symbols.

Is ASCII 7 bits or 8 bits?

Originally used in the United States, it was a 7-bit 128 characters (ASCII), which can be called the basic ASCII code. The eighth digit (the highest digit) was not used to indicate a part of a character, but was used for error checking, but not much, and computers were not necessary. So there are a lot of 8-bit character encodings (using the 8th bit in practice to represent the characters of their countries, mainly western European countries) that are emerging, namely 8-bit 256 characters, which can be called ASCII superset, 8-bit version, extended ASCII, etc., They applied the 8th bit character to the character table of their own national language. The last 7 bits are compatible with ASCII. When using the ASCII coding range, the 8th bit is uniformly specified as 0, so that the computer system can process, store and transmit the characters in exactly 1 byte.

Summary: some places are called 7-bit ASCII, the actual use of the 8th bit, 0, easy to process, storage and transmission in a byte.

Attachment 1: ASCII Art

ASCII art, also known as “text graphics”, “character painting”, “text painting”, this mainly computer based art form refers to the use of computer characters (mainly ASCII) to express pictures. It was first developed in Carnegie Mellon University in 1982. When the Internet first came into being, it was frequently used on social networks in the English-speaking world (Usenet, BITNET, Internet forums, FidoNet, bulletin board system BBS). It can be generated by a text editor. Much ASCII art requires the use of fixed-width fonts (fonts of fixed width, such as those used on traditional typewriters) for display.

ASCII art is used when text is more stable and faster than images. Including typewriters, teletypes, terminals without graphics, early computer networks, E-mail, and Usenet news messages.

Ps: We mainly appear in the forum signature, chat, input method (Sogou called emoji)

ASCII art graphics can be placed in AN HTML document, but are usually placed in a <pre> </pre> formatted text tag so that the font displays correctly in a monospaced font. Alternatively, ASCII art can be generated in HTML using CSS.

Then there is the ASCII stereogram, a form of ASCII art that produces a 3-d image’s trompe l ‘oepline by appropriately crossing the lines of sight while viewing the image. You need to find out for yourself

Attachment 2: ASCII Ribbon Operation

The ASCII Ribbon campaign was an Internet phenomenon that started in 1998 and ended in June 2013 to promote plain text emails.

The rationale for the ASCII Ribbon campaign is as follows:

  • Some email clients do not support HTML email. This means that the person receiving the email will probably not be able to read it because it will be displayed as raw HTML code or even as an error message.
  • Other customers have very bad or broken HTML decoding that makes messages hard to read.
  • HTML email takes up a lot of space and is very inefficient. Even if an HTML email does not contain any charts, it is larger than a plain text email. Most Internet users have a space limit on their E-mail accounts and must pay extra if they exceed it.
  • HTML email background images, gorgeous charts (such as the ones Incredimail provides) are considered a waste of outbox and inbox space by action supporters. Having to download 200KB or more of an email for a few lines of text is pretty ludicrous. If you use plain text, you can accurately convey the same information in the same (or even one hundredth) size message.

Ps: HTML mail, that is, your mailbox often received gorgeous advertising (junk) mail. (Personal opinion only)

/o\
// \\ The ASCII
\\ // Ribbon Campaign
 \V/  Against HTML
 /A\  eMail!
// \\


()  ascii ribbon campaign - against html e-mail 
/\  www.asciiribbon.org   - against proprietary attachments


                       _
ASCII ribbon campaign ( )
 against HTML e-mail   X
                      / \

    ~^~
  //   "\\ /\\ \\ // //\\\ ------------------- @ // ///=\\SCII Ribbon Campaign X /=---=\\gainst HTML E- Mail X /// \\ // \\ ---------------------------- \// \\ / \\ \\Copy the code

EASCII code

After the computer appeared, it first gradually developed from the United States to Europe. Because many Countries in Europe used in the character, in addition to the basic, the United States also used that 128 ASCII characters, there are a lot of derivative Characters such as Latin letters. In French, for example, there are phonetic symbols above letters; Other European countries also have their own characters.

Considering that a byte can represent 256 encodings (2^8 = 256), ASCII characters use only the lower 7 bits of a byte (thus the highest bit in ASCII is always 0) and are numbered 0x00 to 0x7F (0 to 127 in decimal).

That is, ASCII uses only the first 128 (2^7 = 128) of the 256 encodings that can be represented by a single byte. The last 128 encodings are essentially idle. As a result, European countries have come up with the latter 128 codes.

The problem is that the same idea is shared across Europe. Therefore, different countries have different designs for the characters corresponding to the 128 codes of 0x80~0xFF (128~255 in decimal notation).

Ps: This is ISO 646 in different versions, and even a few countries have changed the compatibility with the base ASCII code. So ASCII is not the same as ISO 646.

In order to put an end to this chaos of European countries, two unified single-byte encoding schemes were devised that supported both ASCII and the derived characters used in European countries: One is EASCII (Extended ASCII) character encoding scheme and the other is ISO/IEC 8859 character encoding scheme.

Ps: EASCII launched the ISO 8859 series in 1964, and ANSI began this work in collaboration with ECMA in 1982. In 1985, ECMA-94 was published, ISO/IEC 8859 parts 1, 2, 3, 4 which was later adopted for standardization. Parts 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, and 16 were published in 1988, 1987, 1987, 1987, 1987, 1989, 1992, 2001, 1997 (officially renouncing R&D), 1998, 1998, respectively In 1999, 1999, 2001.

Let’s start with EASCII code. The EASCII code is also used to encode new characters from the idle highest bits in ASCII (i.e., the first digit) (the highest bits of new characters outside of these ASCII characters are always 1). In other words, all eight bits of a byte are used to represent a character. For example, the code for e in French is 130 (binary 1000 0010).

Obviously, EASCII codes use the same single-byte encoding as ASCII codes, but can represent up to 256 characters (2^8 = 256), twice as many as ASCII’s 128 characters (2^7=128).

Therefore, in EASCII code, when the first bit (the highest bit of a byte) is 0, it still represents the usual ASCII characters (the actual binary code is 0000 0000 ~ 0111 1111, and the corresponding decimal code is 0~127). A value of 1 represents other derivative characters that complement the extension (the actual binary encoding is 1000 0000 to 1111 1111, and the corresponding decimal encoding is 128 to 255).

In this way, on the basis of ASCII code, it not only ensures the compatibility of ASCII code, but also supplements and extends new characters, so it is called Extended ASCII code (Extended ASCII) code, referred to as EASCII code.

The symbols extended from EASCII code to ASCII code include table symbols, computation symbols, Greek letters and special Latin symbols, as shown in the following table.

However, EASCII code is now rarely used, the ISO/IEC 8859 character encoding scheme is commonly used.

The ISO 8859 series

ISO 8859 series scheme is similar to EASCII code. It also uses the highest bit (first digit) not used in ASCII 7-bit code on the basis of ASCII code, and changes the coding range from 0x00~0x7F (0~127 in decimal) of original ASCII code. Extended to 0x80 to 0xFF (128 to 255 in decimal).

In 1982, ANSI began this work in cooperation with ECMA. In 1985, ECMA-94 was published and adopted as [ISO/IEC 8859 Parts 1](ISO/IEC 8859 Parts 1), 2, 3, 4 in 1988.

It should be noted that of the 128 encodings extended by the ISO 8859 character encoding scheme, only 0xA0~0xFF (160~255 in decimal) are actually used. That is, only these 96 encodings define characters, and the 32 encodings 0x80 to 0x9F (128 to 159 in decimal) do not.

Obviously, the ISO/IEC 8859 (ISO 8859) character encoding scheme is also a single-byte encoding scheme and is also fully ASCII compatible.

ISO/IEC 8859 is a set of 15 character sets, namely ISO/IEC 8859-N, n=1, 2, 3… 15, 16 (12 of them are undefined, so there are 15 in total). Ps: See Wikipedia – click on the link above to jump to.

These 15 character sets include roughly all the characters used in European countries (and even some foreign characters), and each of the complementary extensions (that is, the parts that are not compatible with ASCII characters) actually uses only the 96 encoding 0xA0 to 0xFF (160 to 255 in decimal).

Ps: This means that the 96 encodings of the extension are incompatible with each other except for the base ASCII part.

The ISO-8859-X character set is a specific character set that combines several components of ISO-2022. These ingredients include:

  • Low end control character (C0)
  • ASCII Character Set (GL)
  • High-end control character (C1)
  • The High End character (GR) is specific to each ISO-8859-X variant. For example, ISO-8859-1 consists of ISO-IR-1, ISO-IR-6, ISO-IR-77, and ISO-IR-100.

Concepts such as C0, GL, C1, GR, etc. are described in detail in wikipedia of ISO 2022 and ISO 8859. If you want to learn more, follow the link.

Among them, ISO/IEC 8859-1 contains the characters commonly used in Western Europe (including German and French letters), the most commonly used at present. ISO/IEC 8859-1 is often abbreviated to ISO 8859-1, and is also known as Latin-1.

Note: the “Codepage 819” in front of the image title indicates that the ISO 8859-1 Codepage number is 819. More on the “Codepage” later.

The remaining characters in ISO 8859-2 through ISO 8859-16 are as follows:

  • The ISO 8859-2 character set, also known as Latin-2, includes eastern European characters;
  • The ISO 8859-3 character set, also known as Latin-3, contains southern European characters;
  • The ISO 8859-4 character set, also known as Latin-4, contains Nordic characters;
  • The ISO 8859-5 character set, also known as Cyrillic, contains the Slavic characters;
  • The ISO 8859-6 character set, also known as Arabic, contains the Arabic language characters;
  • The ISO 8859-7 character set, also known as Greek, contains the Greek characters;
  • The ISO 8859-8 character set, also known as Hebrew, contains Hebrew characters;
  • ISO 8859-9 character set, also known as Latin-5 or Turkish, contains Turkish characters;
  • The ISO 8859-10 character set, also known as Latin-6 or Nordic, contains characters from Northern Europe (mainly Scandinavia);
  • The ISO 8859-11 character set, also known as Thai, is almost identical to the Thai National standard TIS-620 (1990) character set, with the only difference being that, ISO 8859-11 defines the non-breaking space character NBSP (code point value 0xA0), which is not defined in TIS-620;
  • ISO 8859-12 character set, not yet defined (there are two reasons for this. First, it was originally designed as a “Latin-7” containing the Celtic character set, but later the Celtic family became ISO 8859-14 / Latin-8; Another was reserved for Indian Sanskrit, but later shelved);
  • The ISO 8859-13 character set, also known as Latin-7, mainly covers the Baltic characters, with some Latvian characters missing from Latin-6.
  • ISO 8859-14 character set, also known as Latin-8, which replaces certain symbols in Latin-1 with Celtic characters;
  • The ISO 8859-15 character set, also known as Latin-9, or latin-0, removes the lesser-used symbols in Latin-1 and replaces them with missing French and Finnish letters. It also replaces the money symbol between the Pound and the yen with the European Union currency symbol.
  • The ISO 8859-16 character set, also known as Latin-10, covers Albanian, Croatian, Hungarian, Italian, Polish, Romanian, slovenian and other southeast European languages.

The ISO 2022 series

Between the release timelines of EASCII code and ISO 8859 code introduced earlier (1964-1988), there is another important character code ISO 2022.

As 8-bit character encodings emerged, they were used to represent more characters than ASCII. For this reason, the ECMA-35 standard, published in 1971, sets out common rules that should be followed by various 7 – and 8-bit character encodings. Ecma-35 was subsequently adopted as ISO 2022. It is particularly famous for its encoding methods for East Asian languages: Chinese, Japanese or Korean.

English can be stored in 7-bit code, while other languages using Latin alphabet, Greek alphabet, Cyrillic alphabet, Hebrew alphabet, because only dozens of letters, are traditionally represented by the ISO/IEC 8859 standard 8-bit code. However, due to the large number of words in Chinese, Japanese and Korean, it is impossible to use a single 8-bit character to express, so more than one byte is needed to represent a word. Thus, ISO 2022 was designed to allow Chinese, Japanese, and Korean to be represented by several seven-bit encoded characters.

ISO 2022 is used to:

  • Representing characters belonging to more than one character set under one character encoding;
  • Represents a large character set;
  • Compatible with 7-bit channels, even with 8-bit coded character sets.

ISO 2022 uses “Escape sequences” to indicate which character set the following characters belong to. These character sets are registered with ISO and follow the pattern specified by the ISO 2022 standard. The escape string consists of one “ESC” character (0x1B), followed by two or three strings. This tag represents the character that follows it, a character that belongs to the following table character set. For a character set, if the context can identify which character set it is, it is also possible to specify which character set it is without escaping the sequence. In fact, ISO-8859-1 declares that it does not need to define its escape sequence. For details, please click ISO-8859-1.

ISO 2022 is equivalent to ECMA-35 of the European Standards Organization (ECMA). Chinese GB 2312, Japanese industrial specification JIS X 0202 (formerly known as JIS C 6228) and Korean industrial specification KS X 1004 (formerly known as KS C 5620) all comply with ISO 2022 specifications.

Character Coding (ii: Simplified Chinese Character Coding and ANSI Coding)