This is the 7th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.

Text preprocessing

Tokenizer

keras.preprocessing.text.Tokenizer(num_words=None, 
                                   filters='!" # $% & * () +, -, / :; The < = >? @ ^ _ ` [\] {|} ~ ', 
                                   lower=True, 
                                   split=' ', 
                                   char_level=False, 
                                   oov_token=None, 
                                   document_count=0)
Copy the code

This class allows you to quantize a text corpus using two methods: converting each text to a sequence of integers (each integer is an index of a tag in the dictionary); Or it can be converted to a vector where the coefficients of each tag can be binary values, word frequency, TF-IDF weights, etc.

parameter

  • num_words: Maximum number of words to be reserved, based on word frequency. Only the most common num_words will be retained.
  • filters: a string in which each element is a character to be filtered from the text. The default is all punctuation, plus TAB and newline, minus the ‘character.
  • lower: Boolean value. Whether to convert text to lowercase.
  • split: string. Cut the text by the string.
  • char_level: If True, each character is treated as a tag.
  • oov_token: If given, it is added to word_index and used to replace words out of the vocabulary during a text_to_sequence call.

By default, all punctuation is removed and the text is converted to a space-separated sequence of words (words may contain the ‘character). These sequences are then split into lists of tags. They will then be indexed or vectorized. 0 is a reserved index that is not assigned to any word.

hashing_trick

Hashing_trick converts text to a sequence of indexes in a fixed-size hash space.

keras.preprocessing.text.hashing_trick(text, n,
                                       hash_function=None, 
                                       filters='!" # $% & * () +, -, / :; The < = >? @ ^ _ ` [\] {|} ~ ', lower=True, 
                                       split=' ')
Copy the code

parameter

  • text: Enter text (string).
  • n: Hash space dimensions.
  • hash_function: Defaults to a Python hash function, which can be ‘MD5’ or any function that takes an input string and returns an integer. Note that ‘hash’ is not a stable hash function, so it is inconsistent across runs, whereas’ MD5 ‘is a stable hash function.
  • filters: A list of characters (or concatenations) to filter, such as punctuation. Default:!” # $% & * () +, -, / :; The < = >? ~ @ ^ _ [] {|}, contains basic punctuation, tabs and line breaks.
  • lower: Boolean value. Whether to convert text to lowercase.
  • split: string. Cut the text by the string.

Return value integer word index list (uniqueness cannot be guaranteed).

0 is a reserved index that is not assigned to any word. Because hash functions can conflict, two or more words may be assigned to the same index. The probability of collisions is related to the dimensions of the hash space and the number of different objects.

one_hot

One-hot encodes text as a word-indexed list of size N. This is a wrapper around the hashing_trick function, using hash as the hash function; Word index maps are not guaranteed to be unique.

keras.preprocessing.text.one_hot(text, n, 
                                 filters='!" # $% & * () +, -, / :; The < = >? @ ^ _ ` [\] {|} ~ ', 
                                 lower=True, 
                                 split=' ')
Copy the code

Parameter text: Input text (string). N: an integer. Vocabulary size. Filters: Lists of characters (or links) to filter, such as punctuation marks. Default:!” # $% & * () +, -, / :; The < = >? ~ @ ^ _ [] {|}, contains basic punctuation, tabs and line breaks. Lower: Boolean value. Whether to convert text to lowercase. Split: indicates a character string. Cut the text by the string.

Returns a list of integers between values [1, n]. Each integer encodes one word (uniqueness is not guaranteed).



text_to_word_sequence

Text_to_word_sequence converts text to a sequence of words (or tags).

keras.preprocessing.text.text_to_word_sequence(text, 
                                               filters='!" # $% & * () +, -, / :; The < = >? @ ^ _ ` [\] {|} ~ ', 
                                               lower=True, 
                                               split=' ')
Copy the code

Parameter text: Input text (string). Filters: Lists of characters (or links) to filter, such as punctuation marks. Default:!” # $% & * () +, -, / :; The < = >? ~ @ ^ _ [] {|}, contains basic punctuation, tabs and line breaks. Lower: Boolean value. Whether to convert text to lowercase. Split: indicates a character string. Cut the text by the string.

The return valueA list of words or tokens.

Keras Chinese document



That’s all for this article, if it feels good.❤ Just like it before you go!! ❤

For those who are new to Python or want to learn Python, you can search “Python New Horizons” on wechat to communicate and learn with others. They are all beginners. Sometimes a simple question is stuck for a long time, but others may realize it at a touch. There are also a variety of learning resources waiting for you to receive oh ~.