The basic concept

First of all, understand two concepts: some things are meant to be known by machines, and some things are meant to be known by people.

The bases (2,3,8,10,16) are designed to be understood directly by machines.

Encoding (ASCII, Unicode, UTF-8), only machine understood binary representation of human can understand the character at a glance.

Basic knowledge of

  1. Binary: 0-1
  2. Octal: 0-7
  3. Decimal value: 0-9
  4. Hexadecimal: 0-F

Bits, bytes, characters, strings

  1. Bit: bit representation: 0 or 1, the lowest representation
  2. Representation: Eight bits, such as 01010101, can be combined to represent certain characters
  3. Characters: char is at least 1 byte or 8 bits.Note: different encodings of the 1 character contain different bytes)
  4. String: A combination of characters

coding

  1. ASCII: 1 character =1 byte =8bit.
  2. Unicode: 1 character =2 bytes =16 bits.
  3. Utf-8: Variable length encoding. A character is equal to 1 to 4 bytes.

Relationship, transformation

1. Digital

In Python, integers are converted by int(),bin(),oct(), and hex().

>>> 10
10
>>> bin(10)
'0b1010'
>>> oct(10)
'0o12'
>>> hex(10)
'0xa'
>>> int('0b1010',base=2)
10
>>> int('0o12',base=8)
10
Copy the code

2. The character

In Python, characters are converted by CHR (),ord().

>>> ord('sweat')
27735
>>> chr(27735)
'sweat'
>>> bin(27735)
'0b110110001010111'

>>> ord('a')
97
>>> chr(97)
'a'
>>> bin(97)
'0b1100001'
Copy the code

3. Conversion between code systems

Encode (),decode() (utf8 by default)

>>> tmp_str = 'sweat'
>>> tmp_str.encode()
b'\xe6\xb1\x97'
>>> b'\xe6\xb1\x97'.decode()
'sweat'
Copy the code

Important, the difference between UTF8 and Unicode

A Chinese character is three bytes in UTF8 and two bytes in Unicode. No pushy, see the curative effect

>>> tmp_str
'day'
>>> tmp_str.encode('unicode_escape')
b'\\u65e5'
>>> tmp_str.encode('utf8')
b'\xe6\x97\xa5'
>>> b'\\u65e5'.decode('unicode_escape')
'day'
>>> b'\xe6\x97\xa5'.decode('utf8')
'day'
Copy the code