The basic concept
First of all, understand two concepts: some things are meant to be known by machines, and some things are meant to be known by people.
The bases (2,3,8,10,16) are designed to be understood directly by machines.
Encoding (ASCII, Unicode, UTF-8), only machine understood binary representation of human can understand the character at a glance.
Basic knowledge of
- Binary: 0-1
- Octal: 0-7
- Decimal value: 0-9
- Hexadecimal: 0-F
Bits, bytes, characters, strings
- Bit: bit representation: 0 or 1, the lowest representation
- Representation: Eight bits, such as 01010101, can be combined to represent certain characters
- Characters: char is at least 1 byte or 8 bits.Note: different encodings of the 1 character contain different bytes)
- String: A combination of characters
coding
- ASCII: 1 character =1 byte =8bit.
- Unicode: 1 character =2 bytes =16 bits.
- Utf-8: Variable length encoding. A character is equal to 1 to 4 bytes.
Relationship, transformation
1. Digital
In Python, integers are converted by int(),bin(),oct(), and hex().
>>> 10
10
>>> bin(10)
'0b1010'
>>> oct(10)
'0o12'
>>> hex(10)
'0xa'
>>> int('0b1010',base=2)
10
>>> int('0o12',base=8)
10
Copy the code
2. The character
In Python, characters are converted by CHR (),ord().
>>> ord('sweat')
27735
>>> chr(27735)
'sweat'
>>> bin(27735)
'0b110110001010111'
>>> ord('a')
97
>>> chr(97)
'a'
>>> bin(97)
'0b1100001'
Copy the code
3. Conversion between code systems
Encode (),decode() (utf8 by default)
>>> tmp_str = 'sweat'
>>> tmp_str.encode()
b'\xe6\xb1\x97'
>>> b'\xe6\xb1\x97'.decode()
'sweat'
Copy the code
Important, the difference between UTF8 and Unicode
A Chinese character is three bytes in UTF8 and two bytes in Unicode. No pushy, see the curative effect
>>> tmp_str
'day'
>>> tmp_str.encode('unicode_escape')
b'\\u65e5'
>>> tmp_str.encode('utf8')
b'\xe6\x97\xa5'
>>> b'\\u65e5'.decode('unicode_escape')
'day'
>>> b'\xe6\x97\xa5'.decode('utf8')
'day'
Copy the code