JavaScript bits, bytes, characters, base, ASCII, Unicode (1)

The history of

byte

Inside a computer, all information is ultimately a binary value

A bit is the smallest unit of data storage inside a computer. Each binary bit has zero and two states

A byte is a unit of binary data. A byte is usually 8 bits long. 1B(byte) = 8 bits, so eight bits can be combined into 256 states!

Think 1:25 how can I figure out the six states?

Unit conversion

1byte = 8bit
1byte = 1B
1KB = 1024B
1MB = 1024KB
1GB = 1024MB
1TB = 1024GB

Into the system

For any base, the X base, it means that the number in every position is carried one place in X. Decimal is every ten, hexadecimal is every hex, binary is every two, and so on, x is every x.

Now look at the various base 25 representations using decimal 25 as an example;

leta=0b11001; // 0b stands for binary console.log(a); / / 25letb=0o31; // 0o octallet c=25;
letd=0x19; //0x hexadecimalCopy the code

Hexadecimal conversion

ToString and parseInt are two built-in methods in JS;

ToString converts a number to a specified base. ParseInt is a number parsed to decimal in a specified base

1. Convert any base to decimal;

console.log(parseInt('11001', 2)) / / 25Copy the code

console.log(parseInt('007F', 16)) / / 127Copy the code

2. Decimal to arbitrary base;

(25).toString(2)//11001
Copy the code

Thought 2: Switch from arbitrary base to arbitrary base?

A byte is a unit that computers can recognize, but people are not used to recognizing bytes, so character encodings come into being!

ASCII

At first, computers were only used in the United States. In English, it was possible to represent everything in 128 symbols, and these 128 symbols took up only the last seven bits of a byte, with the first one uniformly defined as zero

Common ASCII code size rules range from 0 to 9

1) Numbers are smaller than letters. Such as “7” < “F”;

2) The number 0 is smaller than the number 9 and increases in order from 0 to 9. Such as “3” < “8”;

3) The letter A is smaller than the letter Z and increases from A to Z. Such as “A” < “Z”;

4) A capital letter of the same letter is 32 smaller than a lowercase letter. Such as “A” < “A”.

The ASCII character set is still in use today, but its biggest drawback is that it can only represent the basic Latin alphabet, Arabic numerals and British punctuation marks, so each country has its own character coding rules; For example, GB2312 in China;

GB2312

Characters with an ASCII value less than 127 have the same meaning as characters in the original ASCII set
But when two characters with an ASCII value greater than 127 are joined together, they represent a simplified Chinese character
The first byte (high byte) is expanded from 0xA1 to 0xF7, and the second byte (low byte) is expanded from 0xA1 to 0xFE, creating about 7,000 simplified Chinese characters.
ASCII numbers, punctuation marks, and letters are all rewritten into two-byte codes. These are called “full corner” characters.

It is often said that one Chinese character counts as two English characters; By recognizing whether the first byte is greater than 127, the computer knows whether two bytes should be read together or one byte at a time.

Each country has its own unique language and character code, and Unicode was created to unify the code of all characters. Unicode consolidates all languages into one code, so that there is no garbled problem anymore.

Unicode character Set

Each character is assigned a number, which is called a code point.

Unicode Encoding list of Chinese characters

As can be seen from the table above, the hexadecimal representation of the Chinese character d is 4E01

// javascript query character Unicode code console.log('丁'.charcodeat ())//19969 // Character encoding into String console.log(string.fromCharcode ())'19969')); // the hexadecimal notation for console.log('丁'.charCodeAt().toString(16)); //4e01Copy the code

Unicode usually uses two bytes to represent a character, and the original English encoding is changed from a single byte to a double byte by filling all the high bytes with zeros.

From Unicode onwards, both half-angle English characters and full-angle Chinese characters are two characters

A byte is an 8-bit physical storage unit
The character is a culturally relevant fit

Unicode is just a character set that specifies the binary code for a character, but not how that binary code should be stored. For example, the Unicode code for Chinese characters is all three bytes. How does the computer know that three bytes represent one character instead of three characters? UTF was created to facilitate Unicode transfer and storage.

UTF with Unicode

UTF (Unicode Transformation Format) Is a Unicode-specific conversion format used to convert a Unicode code point to a specific sequence of bytes. Common UTFS include UTF-8,UTF-16, and UTF-32.

Utf-8 (Encoding mode)

One of the biggest features of UTF-8 is that it is a variable length encoding method. It can use 1 to 4 bytes to represent a symbol, varying the length of the byte depending on the symbol.

Utf-8 encoding rules are simple, with only two rules:

1) For a single-byte symbol, the first byte is set to 0 and the next 7 bits are the Unicode code for the symbol. So utF-8 encoding is the same as ASCII for English letters.

2) For n-byte symbols (n > 1), the first n bits of the first byte are set to 1, the n + 1 bits are set to 0, and the first two bits of the following byte 121 are set to 10. The remaining bits, not mentioned, are all Unicode codes for this symbol.

The following table summarizes the coding rules, with the letter X representing the bits of code available.

Range of Unicode characters | utf-8 encoding (hexadecimal) | (binary) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 0000 0000-0000 007F | 0xxxxxxx 0000 0080-0000 07FF | 110xxxxx 10xxxxxx 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxCopy the code

//Unicode is all hexadecimal, and all Chinese characters are 3 bytes; // convert Unicode to UTF-8 hexadecimal;function transfer(num){
    let arr=["1110"."10"."10"];
    letstr=num.toString(2); // Convert to base 2; 100111000100101letlen=str.length; arr[2]+=str.substring(len-6); Arr [1]+= str.subString (len-12,len-6); arr[0]+=str.substring(0,len-12).padStart(4,'0');
    return arr.map(item=>parseInt(item,2).toString(16)).join(' '); 
}
letr=transfer(0x4e25); console.log(r); //e4b8a5Copy the code

The Chinese character Yan is taken as an example to demonstrate how to implement UTF-8 encoding.

The strict Unicode is 4E25 (100111000100101). According to the table above, 4E25 is in the range of the third line (0000 0800-0000 FFFF), so the strict UTF-8 encoding requires three bytes. That is, the format is 1110XXXX 10XXXXXX 10XXXXXX. Then, starting with the last bit, the x in the format is filled in from back to front, and the extra bits are filled in with zeros. The result is that the strict UTF-8 code is 11100100 10111000 10100101, which translates into hexadecimal E4B8A5.

Sort array sort method

The original intention of this paper comes from the research of sorting by ASCII code when the sort method has no parameters!

Var arr = [1, 2,'A'.'Z'.'a'.'z']; arr.sort(); console.log(arr); / / [1, 2,'A'.'Z'.'a'.'z']; Var a=[1,2, 1]'A'.'AB'.'the changsha'.'Beijing'.'a'.'ab'.'ac'.'z']; a.sort(); console.log(a); / / [1, 2,"A"."AB"."a"."ab"."ac"."z"."Beijing"."Changsha"] // Compare the unicode size of the first string and the unicode size of the second string if they are the same; So AC comes before Z and after AB;'Beijing'.charCodeAt(); / / 21271'the changsha'.charCodeAt(); // The Unicode of Changsha is larger than that of Beijing, so Changsha ranks behind Beijing;Copy the code

Through the study of sorting rules, extended to the entire exploration of computer coding!

Refer to the article

Unicode and JavaScript in detail

Character encoding notes: ASCII, Unicode and UTF-8