BOM

The UTF encoding family uses the BOM technique, which uses a special character (zero width no-break space), a unicode-free code point with a value of 0XFEFF, placed at the beginning of the file when the editor reads it. If you look at the BOM, you know what code the document is in.

es = 'A'
codes = ['utf-32'.'utf-16']
print([es.encode(code) for code in codes])
Copy the code

The output [b’\ XFF \xfe\x00\x00A\x00\x00\x00′, b’\ XFF \xfeA\x00′] at the beginning of the byte sequence is BOM. Sometimes it may be of the form \xfe\ XFF, indicating that the text is in large byte order. The UTF-8 encoding of 0xFEFF is 0x\ef0x\bb0x\bf.

When we use UTF-16LE or BE encoding, the output does not carry a BOM header. This is because the program thinks that specifying the size side of the code means that we know and understand how we use byte sequences.

BOM in Windows Notepad

BOM technology is common in Windows Notepad. Create a text file, write the same content, and then save it as one of the four formats supported by Notepad: ANSI, Unicode, Unicode Big Endian, UTF-8, view their binary form to see the BOM field and the corresponding encoded byte sequence. The ‘B’ option when Python opens a file, read in binary mode.

# File contents are: abAB * *, but save time to choose a different encoding format. fs = ['ansi.txt'.'unicode.txt'.'unicode big endian.txt'.'utf-8.txt']

for f in fs:
    with open(f,'rb') as f_:
        print(f_.readline())
Copy the code

The output is:

b'abAB\xb9\xae\xa1\xef\xa1\xee'
b'\xff\xfea\x00b\x00A\x00B\x00\xe9\x5d\x05\x26\x06\x26'
b'\xfe\xff\x00a\x00b\x00A\x00B\x5d\xe9\x26\x05\x26\x06'
b'\xef\xbb\xbfabAB\xe5\xb7\xa9\xe2\x98\x85\xe2\x98\x86'
Copy the code

It can be seen from the character gong

  • ANSI actually uses the GB series code.
  • Unicode. TXT and Unicode Big Endia. TXT are stored in UTF-16 encoding. TXT is utF-16LE, and Unicode Big Endia. TXT is UTF-16BE.
  • Utf-8 is a UTF-8 storage format with a BOM header.

BOM of other editors

Note pad uses BOM as self-marking of text encoding information, but this is not mandatory for editors, that is, with or without BOM information. For example, VIM editor under Linux does not carry BOM information, which is also a problem often encountered when editing across operating systems and editors.

In viM setting, query BOM: Set bomb? , cancel BOM: :set nobomb, set BOM: Set bomb

Why is the ABOVE ANSI GB series code?

ANSI and code page for Windows

As for the ANSI format why GB encoding form, need to say more. ANSI is the American National Standards Institute. Countries (non-Latin speaking countries) specify their own language codes, which are used after being approved by ANSI. So the actual ANSI reference code varies from country to country, for example GB2312 for China, JIT for Japan, Big5 for Taiwan. ANSI is really a “pointer” to a specific encoding.

Windows Notepad, by contrast, uses ANSI to refer to the local code; Windows uses code pages to refer to the system code, but the two concepts can be considered one and the same. Code page or ANSI is visible by typing CHCP from the command line window. For example, simplified Chinese operating systems return active code pages; 936, this is CP936, also known as GBK coding specification. Code pages are also “Pointers” to specific codes.

Selecting ANSI format in Notepad actually selects the specific code referred to by the active code page of the current system. In this example, selecting ANSI in Notepad means selecting code page 936, and code page CP936 points to GBK code.

When switching between Unix and Windows systems, text often appears ^M and all lines are squeezed into one line.

Newline problem

When UE editor or VIm hexadecimal opens the file at Window, you will see the last two bytes 0x0d0a. This is the ASCII code for a newline (0x0A or \n) and carriage return (0x0d or \r) in Windows, respectively.

The VIM editor uses the command %! XXD conversion hexadecimal display; Use the command :%! Xxd-r converts to text display.

It all started with a mechanical typewriter. When a mechanical typewriter had finished printing a full line, it manually pulled the car back to the left.

  • Windows borrows from these two characters, using carriage return and line feed to indicate that one line is complete and another line is started.
  • Unix systems only use newlines.
    • Unix systems open Windows system files, because of the carriage return symbol, often see strange at the end of the line^MSymbol, which is generated by Unix parsing \r.
    • When you open a Unix system file on A Windows system, all the lines are squeezed together and become a single line because the carriage return is missing.
  • Mac systems use carriage return \r to indicate a newline.

Coding problems in Python programming

Above, we have cleared up the development of character coding, know BOM and different operating systems, editors to use BOM. How do we deal with coding in Python programming? There are several points to note:

Utf-8 is recommended

To facilitate international communication, Python uses UTF-8 as the default encoding format. Utf-8 is recommended for everyday web programming, database default encoding, and text storage.

Decode and encode usage

When do we encode, when do we decode? One rule to keep in mind is that Python3 strings are Unicode code points, which means that all Python literals are actually Unicode code point integers. All conversions take Unicode code points as intermediate:

  • Encode is used when converting Unicode code points to byte sequences
  • Decode is used when byte sequences are converted to Unicode
str = 'gong of abAB'# gbk_seq = str.encode('gbk'Str2 = GBk_seq.decode ()'gbk'Utf_seq = str2.encode()'utf-8'Unicode code points are converted to UTF- 8 -Sequence of bytesprint(str,str2,gbk_seq,utf_seq)
Copy the code

Python2 literals are actually sequences of bytes with an encoded format.

Sandwich principle

The sandwich rule is that when reading or writing a file, the encoding must be explicitly specified to ensure that only Unicode code points, not byte sequences, are dealt with in Python programs. Only on input and output does the file convert to a byte stream under the corresponding encoding. This type of processing, with specific codes at both ends and only Unicode inside, is like two slices of bread sandwich-hence the sandwich principle.

with open('ansi.txt'.'r',encoding='gbk')as f,open('u8.txt'.'w',encoding='utf-8') as f2:
    s = f.readline()
    s = s[::- 1]
    f2.write(s)
Copy the code

The above code, s string related processing is Unicode code points. The corresponding encoding GBK and UTF-8 are specified only for input and output.

Do not use PYTHon2

Do not use PYTHon2! The Str type in Python2, which is both a string and a sequence of bytes, is particularly complex. Do not use it unless special requirements are met. The following is enough to get a sense of Python2’s complexity:

Python2, s = ‘abAB gong ‘, is a sequence of bytes whose exact value varies depending on the encoding used (GBK, UTF-8). U = u’abAB gong ‘represents Unicode code points. In addition, the # -* -coding: UTF-8 declaration in Python2 specifies the encoding type of constants in the program, which must be consistent with the encoding type of the text saved.

Use Chardet to guess the encoding of the byte sequence

If the encoding of the byte stream cannot be determined, the module Chardet can be used to guess the encoding of the byte stream.

import chardet
chardet.detect(b'\xef\xbb\xbfabAB\xe5\xb7\xa9\xe2\x98\x85\xe2\x98\x86')
Copy the code

Return is a dictionary {‘encoding’:’ UTF-8 ‘,… }, indicating the most likely encodings and their reliability.

conclusion

  • To self-label the file encoding, BOM tags are introduced. We analyzed the performance of BOM in notepad and learned that BOM is not mandatory.
  • You learned about BOM, newline, and many other issues between Linux and Windows.
  • Learned the sandwich principles of programming in Python; When reading or writing a file, specify the file encoding explicitly.

Summary of this series

  • Simple from ASCII, but does not support other countries.
  • To extend ASCII support for other languages, but generate garbled characters.
  • MBCS multi-byte encoding can support Chinese, a language with many characters, but too many languages are incompatible with each other.
  • Unicode unifies the characters of the world’s languages. Among several encoding forms of Unicode;
    • Utf-32 is simple but wasteful.
    • Utf-16 uses two bytes for storage, saving space.
    • Utf-8 uses one byte of direct storage, which is efficient and space-balanced.
  • To self-label the file encoding, BOM tags are introduced. We analyzed the performance of BOM in notepad and learned that BOM is not mandatory.
  • You learned about BOM, newline, and many other issues between Linux and Windows.
  • Learned the sandwich principles of programming in Python; When reading or writing a file, specify the file encoding explicitly.

Human communication due to different languages and characters, causing great trouble, countless IT practitioners also encountered a variety of gibberish torture, do not seriously solve the coding problem, he will jump out from time to time. We need to recognize that coding is the result of a balance of evolution and optimization, including a set of optimizations, exceptions, and compatibilities.

In this paper, the code

https://gitee.com/gongqingkui/codes/0qku9tc6az3j2dvoy5s8w39

Gong Qingkui, Da Kui, interested in computer and electronic information engineering. gongqingkui at 126.com

Appreciate the author

Read more

Top 10 Best Popular Python Libraries of 2020 \

2020 Python Chinese Community Top 10 Articles \

5 minutes to quickly master the Python timed task framework \

Special recommendation \

\

Click below to read the article and join the community