Unicodedata - Unicode database - Moment For Technology

Previously: In the test code, the close Angle bracket (>) indicates the command entered on the command line. A single line starting with the well character (#) is the output structure; The import of the library is only shown in the first test code of this article; the other code blocks omit the import of the library.

System type: Windows 10
Python version: Python 3.9.0

This module provides access to Unicode Character Database (UCD). Character attributes for all Unicode characters are defined in UCD.

unicodedata.lookup(name)

Find Unicode characters by their names.

Parameter: name: Unicode character name Returned value: STR, Unicode characterCopy the code

Search for unicode characters based on the name passed in. If found, Unicode characters are returned, otherwise KeyError is raised.

import unicodedata

print(unicodedata.lookup('Cjk Compatibility Ideograph-2f80f'))
# rabbit
print(unicodedata.lookup('Armenian Small Ligature Men Now'))
# ﬓ
print(unicodedata.lookup('111'))
# Traceback (most recent call last):
# File "e:\project\test\test.py", line 3, in 
      
# print(unicodedata.lookup('111'))
# KeyError: "undefined character name '111'"
Copy the code

unicodedata.name(chr[, default])

Gets the name of a Unicode character

Parameter: CHR: STR, character default: STR, optional parameter, default value returned when no Unicode character name is found Returned value: STR, unicode character name, or the default value passed inCopy the code

Returns the name if found, the default value if not found, and a ValueError error if not found and no default value is passed.

print(unicodedata.name('a'))
# LATIN SMALL LETTER A
print(unicodedata.name('❋'))
# HEAVY EIGHT TEARDROP-SPOKED PROPELLER ASTERISK
print(unicodedata.name('✍'))
# WRITING HAND
Copy the code

PS: It is not the small editor that does not test the default values and report errors. It is the character that does not find unicode characters at all. 😂

The three useless brothers?

unicodedata.decimal(chr[, default])
unicodedata.digit(chr[, default])
unicodedata.numeric(chr[, default])
Copy the code

Gets a value that represents a numeric character

Parameter: CHR: STR, numeric character default: STR, optional, default value returned when no Unicode character name is found Returned value: int, numeric character valueCopy the code

A numeric character is passed in, and a numeric value representing the numeric character is returned. If the passed argument does not meet the requirements, ValueError is raised, and the value of the default argument is returned when the default argument has a value.

print(unicodedata.decimal('3'))
# 3
print(unicodedata.decimal('b', 'Error'))
# Error
print(unicodedata.decimal('b'))
# Traceback (most recent call last):
#   File "e:\project\test\test.py", line 3, in <module>
#     print(unicodedata.decimal('b'))
# ValueError: not a decimal
Copy the code

After some testing, we found that we can only pass in the 10 STR numbers ‘0’ to ‘9’. The rest will return an error. If you pass in a STR number and return an int number, what does this function do? If you pass in a STR number, you return an int number. The unicodedata.numeric(CHR [, default]) function is similar in that it only passes a STR number and returns an int number. I don’t understand. I really don’t understand.

The unicodedata.numeric() function is more powerful than the first two, passing in a STR character, but only the 10 STR digits ‘0’ to ‘9’. The unicodedata.numeric() function can be used to abbreviate Chinese numbers, Chinese uppercase numbers, Roman numerals, and even character numbers (①) to return the corresponding result.

print(unicodedata.numeric('①'))
# 1.0
print(unicodedata.numeric('一'))
# 1.0
print(unicodedata.numeric('one'))
# 1.0
print(unicodedata.numeric('Ⅰ'))
# 1.0
print(unicodedata.numeric('1'))
# 1.0
Copy the code

Classification team

unicodedata.category(chr)
unicodedata.bidirectional(chr)
unicodedata.east_asian_width(chr)
Copy the code

In Unicode, there are many classification standards, and characters have their own classification positioning in each classification standard. The above three functions are the classification names of query characters under the three classification standards respectively.

Parameters: CHR: STR, character return value: STR, class nameCopy the code

Unicodedata. The category (CRH) access to the specified character’s name of the classification in the conventional classification, unicodedata. The bidirectional (CRH) can obtain two-way character type of the specified character name, Unicodedata.east_asian_width (CHR) Gets the character width class name of a specified character.

In Unicode, most characters are displayed from left to right, but some characters are displayed from right to left, such as Arabic, Hebrew, etc., so Unicode adds a bidirectional algorithm. The end result is another bidirectional character classification for each character.

Let’s talk about the character width category. In Unicode, characters have widths, which means that characters display widths differently, especially east Asian characters. So Unicode classifies all widths and classifies characters by width. The east_asian_width() function returns the width type of the specified character. The name of the function indicates that east Asian characters may have different widths, so that the text in this region can be used properly.

High warning, a wave of test code hit:

Get general classification of characters
print(unicodedata.category('①'))  The # symbol
# No
print(unicodedata.category('one'))  # in Chinese
# Lo
print(unicodedata.category('✔'))  The # symbol
# So
Gets the bidirectional character type of a character.
print(unicodedata.bidirectional('ب'))  # a letter in Arabic
# AL
print(unicodedata.bidirectional('①'))
# ON
print(unicodedata.bidirectional('one'))
# L
print(unicodedata.bidirectional('✔'))
# ON
print(unicodedata.bidirectional('1'))
# EN
Gets the character width category of a character.
print(unicodedata.east_asian_width('①'))  The # symbol
# A
print(unicodedata.east_asian_width('one'))  # in Chinese
# W
print(unicodedata.east_asian_width('あ'))  # Japanese
# W
print(unicodedata.east_asian_width('ب'))  # a letter in Arabic
# N
print(unicodedata.east_asian_width('1'))  # Numbers
# Na
print(unicodedata.east_asian_width('A'))  # English letters
# A
Copy the code

If you look at the test code, you’ll notice that the class names you get are all abbreviations, which makes it harder to understand what each class name means. You can find instructions in the Unicode documentation, which is listed in Resources below.

unicodedata.combining(chr)

Gets the canonical combination value of the specified character

Parameters: CHR: STR, character return value: int, the canonical combination of Unicode charactersCopy the code

Returns the canonical combination value of the specified character, sorted in ascending order during normal operations, or 0 if no canonical combination value is defined for the character.

print(unicodedata.combining('one'))
# 0
print(unicodedata.combining('1'))
# 0
print(unicodedata.combining('A'))
# 0
Copy the code

unicodedata.mirrored(chr)

Checks whether the specified character is a mirror character

Parameter: CHR: STR, character return value: int, 0 or 1Copy the code

Can be understood as specifying whether characters are normally required to appear in pairs or with other characters, such as (and). Returns 1 if it is such a character, 0 otherwise.

print(unicodedata.mirrored('mouth'))
# 0
print(unicodedata.mirrored('1'))
# 0
print(unicodedata.mirrored('A'))
# 0
print(unicodedata.mirrored('/'))
# 0
print(unicodedata.mirrored('('))
# 1
print(unicodedata.mirrored('['))
# 1
print(unicodedata.mirrored('{'))
# 1
print(unicodedata.mirrored('<'))
# 1
print(unicodedata.mirrored('"'))
# 1
print(unicodedata.mirrored("'"))
# 1
Copy the code

unicodedata.decomposition(chr)

Gets the decomposition map of the specified character

Parameter: CHR: STR, character return value: STR, a hexadecimal character stringCopy the code

Gets the decomposition map of the specified character, or returns null if none is present

Print (unicodedata. Decomposition (' ﬁ)) # < compat > 0066 0069 print (unicodedata. Decomposition (' আ ') # Print (unicodedata. Decomposition (' ▇)) # print (unicodedata. Decomposition (' ㄌ ') # print (unicodedata. Decomposition (' ⠟ ') # Print (unicodedata. Decomposition (' 𐊨)) #Copy the code

unicodedata.is_normalized(form, unistr)

Checks whether the specified string is in normal form

Parameter: form: STR, in normal form. Valid values are as follows: NFC, NFKC, NFD, NFKD UNIstr: STR. The returned value is a string of boolCopy the code

Checks whether the specified string is normal, returning True if it is, False otherwise.

print(unicodedata.is_normalized('NFC'.'ā m: a. a not ǒ ě ē e o o e'))
# True
print(unicodedata.is_normalized('NFD'.'ā m: a. a not ǒ ě ē e o o e'))
# False
Copy the code

unicodedata.normalize(form, unistr)

Gets the specified normal form of the specified string

Parameter: form: STR, normal form, valid value: NFC, NFKC, NFD, NFKD unistr: STR, returned value: STRCopy the code

Returns after converting the specified string to the specified form. In the following example, the specified string is not in NFD form, then converted to NFD form, and then whether the converted string is in NFD form. This validates the normalize() function.

temp_str = 'ā m: a. a not ǒ ě ē e o o e'
print(unicodedata.is_normalized('NFD', temp_str))  # String is not NFD
# False
print(unicodedata.is_normalized('NFD', unicodedata.normalize('NFD', temp_str)))  Convert the string to NFD and determine if it is NFD
# True
Copy the code

The properties of two

Unicodedata. unidatA_version You can view the version of the Unicode database used by the currently referenced UnicodeData module.

Unicodedata.ucd_3_2_0 is an object that has the same methods as the unicodedata module, but uses the Unicode database version 3.2.0, which is needed in special cases.

print(unicodedata.unidata_version)
# 13.0.0
print(unicodedata.ucd_3_2_0.unidata_version)
# 3.2.0
print(unicodedata.ucd_3_2_0.name('a'))
# LATIN SMALL LETTER A
Copy the code

Public account: python Grocery store, specializing in Python language and related knowledge. Discover more original articles, looking forward to your attention.

The resources

Official documentation: docs.python.org/zh-cn/3/lib…

The unicode bidirectional character types: www.unicode.org/reports/tr9…

Unicode width of east Asia type: www.unicode.org/reports/tr1…

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Unicodedata — Unicode database

unicodedata.lookup(name)

unicodedata.name(chr[, default])

The three useless brothers?

Classification team

unicodedata.combining(chr)

unicodedata.mirrored(chr)

unicodedata.decomposition(chr)

unicodedata.is_normalized(form, unistr)

unicodedata.normalize(form, unistr)

The properties of two

The resources

Unicodedata — Unicode database

unicodedata.lookup(name)

unicodedata.name(chr[, default])

The three useless brothers?

Classification team

unicodedata.combining(chr)

unicodedata.mirrored(chr)

unicodedata.decomposition(chr)

unicodedata.is_normalized(form, unistr)

unicodedata.normalize(form, unistr)

The properties of two

The resources

Related Posts

Why do so many webmasters choose to use group servers?

Interview: Re-read Volatile and Synchronized to get into Ali

MySQL > DDL MySQL > DDL