5.1- File storage

File storage in a variety of forms, for example, can be saved as TXT plain text, JSON format, CSV format, etc. This section to understand the storage of text files.

The operation of saving data to TXT text is very simple, and TXT text is compatible with almost any platform, but this has the disadvantage of being bad for retrieval. So if the retrieval and data structure requirements are not high, the pursuit of convenience first, you can use TXT text storage. In this section, we’ll look at how to save TXT files using Python.

1. Objectives of this section

In this section, we will save the “Hot Topics” section of the “Discovery” page on Zhihu, and save the questions and answers in a unified text form.

2. Basic examples

First, you can retrieve the page source using Requests, then parse it using the PyQuery parser library, and then save the extracted title, responder, and answer to text as follows:

import requests
from pyquery import PyQuery as pq

url = 'https://www.zhihu.com/explore'
headers = {
    'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
html = requests.get(url, headers=headers).text
doc = pq(html)
items = doc('.explore-tab .feed-item').items()
for item in items:
    question = item.find('h2').text()
    author = item.find('.author-link-line').text()
    answer = pq(item.find('.content').html()).text()
    file = open('explore.txt'.'a', encoding='utf-8')
    file.write('\n'.join([question, author, answer]))
    file.write('\n' + '=' * 50 + '\n')
    file.close()
Copy the code

The main purpose here is to demonstrate how files are saved, so the Requests exception handling section is omitted here. First, use Requests to extract the discovery page of Zhihu, and then extract the full text of the hot topic questions, respondents, and answers. Then use Python’s open() method to open a text file and get a file action object, which is assigned to file. The extracted content is then written to the file using the write() method of the file object. Finally, the close() method is called to close it, so that the captured content can be successfully written to the text.

Run the program, and you can find that a explore. TXT file is generated locally, as shown in Figure 5-1.

Figure 5-1 File content

In this way, the content of the popular questions and answers is saved as text.

Here the first argument to the open() method is the name of the target file to save, and the second argument is a to append to the text. In addition, we specify utF-8 as the encoding of the file. Finally, after writing, you need to call the close() method to close the file object.

3. Opening method

In our example, the second argument to the open() method is set to a so that the source file is not emptied each time text is written, but something new is written at the end of the file, which is a file-opening method. There are several other ways to open files, which are briefly introduced here.

  • R: Open the file in read-only mode. The pointer to the file will be placed at the beginning of the file. This is the default mode.
  • Rb: Opens a file in binary read-only mode. The file pointer will be placed at the beginning of the file.
  • R + : Opens a file in read-write mode. The file pointer will be placed at the beginning of the file.
  • Rb + : Opens a file in binary read/write mode. The file pointer will be placed at the beginning of the file.
  • W: Opens a file in write mode. If the file already exists, it is overwritten. If the file does not exist, a new file is created.
  • Wb: Opens a file in binary write mode. If the file already exists, it is overwritten. If the file does not exist, a new file is created.
  • W + : Opens a file in read-write mode. If the file already exists, it is overwritten. If the file does not exist, a new file is created.
  • Wb + : Opens a file in binary read/write format. If the file already exists, it is overwritten. If the file does not exist, a new file is created.
  • A: To append a file. If the file already exists, the file pointer will be placed at the end of the file. That is, the new content will be written after the existing content. If the file does not exist, a new file is created to write to.
  • Ab: Opens a file in binary appending mode. If the file already exists, the file pointer will be placed at the end of the file. That is, the new content will be written after the existing content. If the file does not exist, a new file is created to write to.
  • A + : Opens a file in read/write mode. If the file already exists, the file pointer will be placed at the end of the file. Files open in append mode. If the file does not exist, a new file is created to read and write.
  • Ab + : Opens a file in binary appending mode. If the file already exists, the file pointer will be placed at the end of the file. If the file does not exist, a new file is created for reading and writing.

4. Simplify

Another shorthand for writing to files is the with as syntax. At the end of the with control block, the file closes automatically, so you don’t need to call the close() method anymore. This can be abbreviated as:

with open('explore.txt'.'a', encoding='utf-8') as file:
    file.write('\n'.join([question, author, answer]))
    file.write('\n' + '=' * 50 + '\n')
Copy the code

If you want to clear the text when saving, you can rewrite the second parameter to w as follows:

with open('explore.txt'.'w', encoding='utf-8') as file:
    file.write('\n'.join([question, author, answer]))
    file.write('\n' + '=' * 50 + '\n')
Copy the code

This method is simple to use, efficient operation, is one of the most basic way to save data.


This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience

For more crawler information, please follow my personal wechat official account: Attack Coder

Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)