This is the 17th day of my participation in the August Text Challenge.More challenges in August

Crawl target

Website: Baidu Library

Tool use

Development tools: pycharm development environment: python3.7, Windows10 using tool kits: requests, re

Key learning content

  • Get url data
  • Regular extraction data
  • Save text data

Project idea analysis

This article mainly introduces how to handle the problem of copy restriction

Before making a crawler project, we should first know the source of data and the loading method of data. For the data loaded on current web page, we need to extract the corresponding data by capturing the packet

The data comes from a JSON file, the data is stored in the C field. After finding the target data, we need to find the loading method of the data resource address. We need to know where the data is loaded from

Find the source of the data by searching for keywords. The data is actually the loaded data of the front-end page. All the download addresses of the data need to be extracted from the article page

Send a network request to the home page of the article through the regular way to extract all the data download address

def get_url(self): url = "https://wenku.baidu.com/view/d19a6bf4876fb84ae45c3b3567ec102de3bddf82.html" headers = { 'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9, image/avif, image/webp image/apng, * / *; Q = 0.8, application/signed - exchange; v=b3; Q =0.9', 'accept-encoding ': 'gzip, deflate, br',' accept-language ': 'zh-cn,zh; Q =0.9', 'cache-control ': 'max-age=0', 'Connection': 'keep-alive', 'Host': 'wenku.baidu.com',' sec-fetch -Dest': 'document', 'Sec-Fetch-Mode': 'navigate', 'Sec-Fetch-Site': 'same-origin', 'Sec-Fetch-User': '?1', 'upgrade-insecure -Requests': '1',' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/86.0.4240.75 Safari/537.36'} Response = self.session.get(url=url,headers=headers) json_data = re.findall('"json":(.*? }])', response.text)[0] json_data = json.loads(json_data) # print(json_data) for index, page_load_urls in enumerate(json_data): # print(page_load_urls) page_load_url = page_load_urls['pageLoadUrl'] # print(index) self.get_data(index, page_load_url)Copy the code

We need to take out the subscript of the corresponding data and all the source addresses of the article data. We need to send a request to the article fragment data to obtain the corresponding JSON data. First, we will take out all the JSON data in WENKU_1 in the regular way

The format can be adjusted and the data will be saved

Easy source sharing

import requests import re import json class WenKu(): def __init__(self): self.session = requests.Session() def get_url(self): url = "https://wenku.baidu.com/view/23de0cea793e0912a21614791711cc7930b778d4.html" headers = { 'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9, image/avif, image/webp image/apng, * / *; Q = 0.8, application/signed - exchange; v=b3; Q =0.9', 'accept-encoding ': 'gzip, deflate, br',' accept-language ': 'zh-cn,zh; Q =0.9', 'cache-control ': 'max-age=0', 'Connection': 'keep-alive', 'Host': 'wenku.baidu.com',' sec-fetch -Dest': 'document', 'Sec-Fetch-Mode': 'navigate', 'Sec-Fetch-Site': 'same-origin', 'Sec-Fetch-User': '?1', 'upgrade-insecure -Requests': '1',' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/86.0.4240.75 Safari/537.36'} Response = self.session.get(url=url,headers=headers) json_data = re.findall('"json":(.*? }])', response.text)[0] json_data = json.loads(json_data) # print(json_data) for index, page_load_urls in enumerate(json_data): # print(page_load_urls) page_load_url = page_load_urls['pageLoadUrl'] # print(index) self.get_data(index, page_load_url) def get_data(self, index, url): headers = { 'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9, image/avif, image/webp image/apng, * / *; Q = 0.8, application/signed - exchange; v=b3; Q =0.9', 'accept-encoding ': 'gzip, deflate, br',' accept-language ': 'zh-cn,zh; Q =0.9', 'cache-control ': 'max-age=0', 'Connection': 'keep-alive', 'Host':' wkbjCloudbos.bdimg.com ', 'sec-fetch -Dest': 'document', 'Sec-Fetch-Mode': 'navigate', 'Sec-Fetch-Site': 'none', 'Sec-Fetch-User': '?1', 'upgrade-insecure -Requests': '1',' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, } Response = self.session.get(url=url,headers=headers) # print(response.content.decode('unicode_escape')) data = response.content.decode('unicode_escape') comand = 'wenku_' + str(index+1) json_data = re.findall(comand + "\((.*?}})\)", data)[0] # print(json_data) json_data = json.loads(json_data) result = [] for i in json_data['body']: Data = I ["c"] # print(data) result.append(data) print('.join(result).replace(' ', '\n')) with open(' '.txt', 'a', encoding='utf-8')as f: f.write(''.join(result).replace(" ", '\n')) if __name__ == '__main__': wk = WenKu() wk.get_url()Copy the code

I am ** white and white I **, a program yuan like to share knowledge ❤️

If you don’t know how to program or want to learn how to program, you can leave a message on this blog. Thank you very much for your likes, favorites, comments, one-click support.