This is the second day of my participation in the November Gwen Challenge. Check out the details: the last Gwen Challenge 2021

1. Why preprocessing?

Natural Language Processing (NLP) refers to Natural Language Processing, which is a technology that studies computers to understand human Language.

To study language processing, you first have to have language text. We talked about serializing fixed-format text using the Tokenizer Tokenizer Tokenizer.

Before serialization, how to process these texts in accordance with a certain format is also a basic work, this step is called data preprocessing.

2. Several carriers of text storage

In general, directly readable text data is stored in one of several files:

  • Databases: SQLite, mysql…
  • Table files: CSV, Excel……
  • Plain text files: TXT, JSON……

Let’s go through how to read and write these files one by one.

2.1 database

Databases store text information and have many advantages:

  • Supports large data storage
  • The query efficiency
  • Supports complex associations

Take an example of an SQLite database to see how data is read.

Here is a database file with a size of 8KB.

There is a table named “CI” in the database. The data structure and contents are as follows:

There are 25 pieces of data stored in the table, which are 25 ci songs, each containing value, rhythmic, author and content.

Suppose we were to use the content of each word as a training set, how would we organize the data?

# Import SQLite support
import sqlite3

Save the content of each word
str_array  = []

Set up a connection by specifying the location of the file (data.db file in the data directory)
conn = sqlite3.connect('./data/data.db')
Select * from 'author' where 'content'; select * from 'author' where 'content'
cursor = conn.execute("SELECT author, content from ci;")
# loop result
for row in cursor:
    # get data with index 1 in result (0 author, 1 content
    ci = row[1]
    Add to the content list
    str_array.append(ci)

# close operation
cursor.close()
conn.close()

Print the result
print(str_array)
Copy the code

The final output is as follows:

[' out of arsenic, the price can be. \ r \ n win the ash and make fire. \ r \ n chang kill me. ', 'prime minister have accident to long-term fellowships, SAN huang fabricated a wide thin dense. \ r \ n win a decade idle.', '. \ r \ n thousand people sample crow. \ r \ n home, which day is coming. ', '. \ r \ n count xuan ma phase. \r\n Before and after autumn, gongdagon more lai clothes. ", "a hundred feet long vines down to the ground, thousands of trees densely towering. \ R \n only in the county border.", "lofty lofty and the sky high.Copy the code

Where \r\n is carriage return newline. We look lofty and sky high. \r\n Things light will be heavy, thousands of miles to send goose feather. \r\n\r\n in the text box display, can help you better understand.

Extended knowledge: write sqlite3

Writing data is simple, similar to reading data. The connection is established first, then the SQL statement is executed, then a COMMIT is added, and finally the connection is disconnected.

As an example, insert two consecutive pieces of data.

# Import SQLite support
import sqlite3

# %% write to table data
conn = sqlite3.connect('./data/data.db')
for t in[(9998.The name of the "1"."The author 1"."Text 1"), (9999."The name of the 2"."The author 2"."The body of the 2")]:
    conn.execute("insert into ci values (? ,? ,? ,?) ", t)
    
conn.commit()
conn.close()
Copy the code

2.2 Table Documents

Compared with databases, tabular documents (CSV, Excel) are also a good way to store text.

It can be opened by double clicking, you can directly manipulate the content, but also use its own tools to do some data processing.

Here is a CSV file with many lines. Each line is a song poem. The first three columns are the name of the song brand, author and content.

Suppose we were to use the content of each word as a training set, how would we organize the data?

import csv

Create an array to store the contents
str_array  = []

# build the reader, specify the location of the file (data.csv file in the data directory), specify the encoding format
csv_reader = csv.reader(open("./data/data.csv",encoding="gbk"))
# loop through each line
for row in csv_reader:
    Select column 3 with index 2 and store it in array
    str_array.append(row[2])
	
# Print data
print(str_array)
Copy the code

The final output is as follows:

[' out of arsenic, the price can be. \ r \ n win the ash and make fire. \ r \ n chang kill me. ', 'prime minister have accident to long-term fellowships, SAN huang fabricated a wide thin dense. \ r \ n win a decade idle.', '. \ r \ n thousand people sample crow. \ r \ n home, which day is coming. ', '. \ r \ n count xuan ma phase. \r\n Before and after autumn, gongdagon more lai clothes. ", "a hundred feet long vines down to the ground, thousands of trees densely towering. \ R \n only in the county border.", "lofty lofty and the sky high.Copy the code

This way, the array data is ready to use.

Extended knowledge: CSV writing

Writing data is similar to reading data. You build a writer, write data, and close the open file.

For example, create a CSV file and insert 1 data.

import csv

Open (create) a file to write, specifying the encoding
f_csv = open('./data/data2.csv'.'w',encoding='gbk', newline=' ')
Get a writer for this file
csv_writer = csv.writer(f_csv)
Write a line of data
csv_writer.writerow(['First column'.'Second column'.'Third column'])

# close file
f_csv.close()
Copy the code

After executing the code, a new datA2.csv file is created in the data directory of the same level, and a row of three columns of data is written.

2.3 Text Documents

Text document (TXT) is the most lightweight way to store text.

It’s not as relational as a database or a table file. It lists only one piece of text, and it doesn’t hold much data. Tens of thousands of lines of text would make it difficult to read.

But it also has advantages. That is — easy to use.

Just open the file and type in the characters.

Because there is no concept of rows and columns, most text documents store data sets with special characters, such as carriage return and line feed characters. A row is a single piece of data.

Here is a piece of text, let’s see how to read it.

import os

Hold an array of each line of text content
str_array  = []

# select data. TXT from data. TXT
f_read = open('./data/data.txt'.'rb')
# Read a line of file
line = f_read.readline()
# If this row exists
while line: 
    # Read the content
    wstr = str(line, encoding = "utf-8-sig")
    # add to array
    str_array.append(wstr)
    # Read the next line
    line = f_read.readline()

# close file
f_read.close()

# print array
print(str_array)
Copy the code

The final output is as follows:

\r\n', 'hard work is a new poem. Yin an a word, fell off several stem mustache. \ R \n',' palpon window touch door into the room. \r\n', \r\n', \r\n', 'call sandaoxian to two Yang linn. condition when the string on the moon, a drunk wish a thousand spring.Copy the code

Extension of knowledge: TXT write

Writing data is similar to reading data. First open a file, write data, and finally close the open file.

For example, create a TXT file and write text.

import os

Open (new) data2.txt as write
f_write = open('./data/data2.txt'.'w')
f_write.write('Write first line of text \n second line')

f_write.close()
Copy the code

3. Combination: clean the data and save it as JSON file

Suppose we want to train a set of data about song ci, and the data source is the data in this table in the database below.

The data may seem neat, but it’s not perfect.

At this point you have several appeals:

  1. Eliminate redundancy: remove redundant data and duplicate data.
  2. Do screening: only want “Linxian River” sentence pattern data.
  3. Change storage: Because the amount of data is not large, you want to store the cleaned data in JSON text format.

Analysis:

  1. Strip () method to remove leading and trailing Spaces and newlines from text. Remove repeated data, can be achieved through the code logic, will encounter the sentence saved, the next sentence to the saved list to find, can find the description of repeat, find not the first time to see.

  2. The format of Linxian River is {[7 Chinese characters]. < line wrap enter >[5 Chinese characters], [5 Chinese characters]. }, can be screened by regular matching, the regular expression is: ^[\u4e00-\u9fa5]{7}. [\ u4e00 – \ \ r \ n u9fa5] {5}, {5} [\ u4e00 – \ u9fa5]. $.

  3. Read the data from the database, filter to the appropriate text, form a JSON string to write text.

The code is as follows:

import sqlite3
import json
import re

# Array to store content
str_array = []
# Content already encountered
keys = {""}

Filter {7. 5, 5. } format a regular expression for the content
pattern = re.compile(R '^ [\ u4e00 - \ u9fa5] {7}. [\ u4e00 - \ \ r \ n u9fa5] {5}, {5} [\ u4e00 - \ u9fa5]. $') 

Connect to database
conn = sqlite3.connect('./data/data.db')
SQL > select * from content field
cursor = conn.execute("SELECT content from ci;")
# loop result
for row in cursor:
    Select * from column whose index is 0
    ci = row[0]
    # Cut the head and tail
    ci = ci.strip()
    # match format
    m = pattern.match(ci)
    No match was found
    if m == None:
        print('\n did not match: ',ci)
    else: # match to
        print('\n Match successful: ',ci)
        # if there is any
        if ci in keys:
            # this is a duplicate, not processed
            print('\n already exists ->',ci)
        else:
            # Do not appear, add to appear list, add to content list
            keys.add(ci)
            str_array.append(ci)

Close cursors and links
cursor.close()
conn.close()

Convert the content list to JSON
j_str = json.dumps(str_array, indent=2, ensure_ascii=False)
Open (New) text
f_write = open('./data/data2.json'.'w')
# write text
f_write.write(j_str)
# close file
f_write.close()
Copy the code

The generated data2.json looks like this: