This article uses the Beautifulsoup4 library to grab the latest Python issues on StackOverflow and store them in a JSON file. The first half exercises the use of Beautifulsoup by crawling multiple fields, and the second half introduces the JSON module

Beautifulsoup is a comprehensive summary of its use in this article

The crawler code

Import requests from bS4 import BeautifulSoup import re import JSON class Stack(object): def __init__(self): Self. Baseurl = 'https://stackoverflow.com' # to grab the url of the splicing self. Starturl = 'https://stackoverflow.com/questions/tagged/python' initial url # def start_requests (self, url) : R = requests. Get (url) return r.content def parse(self, text): BeautifulSoup = BeautifulSoup(text, 'html.parser') divs = soup. Find_all ('div', class_ = 'question-summary') for div in divs: Find ('span', title = re.compile('gold')) silver = div. Find ('span', title = re.compile('silver')) bronze = div.find('span', title = re.compile('bronze')) tags = div.find('div', Class_ = 'summary').find_all('div')[1].find_all('a') # yield {# self.baseurl + div.h3.a.get('href'), 'answer': div.find('div', class_ = re.compile('status')).strong.text, 'view': div.find('div', class_ = 'views ').text[: -7].strip(), 'gold': gold.find('span', class_ = 'badgecount').text if gold else 0, 'tagnames': [tag. Text for tag in tags], # the following is the same as above: div.find('span', class_ = 'vote-count-post ').strong.text, 'time': div.find('div', class_ = 'user-action-time').span.get('title'), 'duration': div.find('div', class_ = 'user-action-time').span.text, 'username': div.find('div', class_ = 'user-details').a.text, 'userurl': self.baseurl + div.find('div', class_ = 'user-gravatar32').a.get('href'), 'reputation': div.find('span', class_ = 'reputation-score').text, 'silver': silver.find('span', class_ = 'badgecount').text if silver else 0, 'bronze': bronze.find('span', class_ = 'badgecount').text if bronze else 0, 'tagurls': [self.baseurl + tag.get('href') for tag in tags]} # def start(self): text = self.start_requests(self.starturl) items = self.parse(text) s = json.dumps(list(items), indent = 4, ensure_ascii=False) with open('stackoverflow.json', 'w', encoding = 'utf-8') as f: f.write(s) stack = Stack() stack.start()Copy the code

The capture results are shown in the figure below


The basics of the above code have been covered in previous articles, so please refer to the following article if you have any questions

  • Basic principles of crawlers
  • Beautifulsoup,
  • Use of classes and generators

Json module Introduction

Json is a built-in module that does not need to be installed by itself. The module mainly uses two functions json.dumps and json.loads

  • The former can turn a List dict Python object into a string that looks the same. Such conversions are generally used to store objects in JSON files, which have the same form as list dict objects, and require strings (or bytes) to store files.
  • The latter converts a List dict-like string into a Python object, which is what you get if you read a JSON file, and turns it into a List dict that Python can handle

Sample code is shown below

Import json a = [{' name ':' Bob ', 'age: 20}, {' name' : 'Mary', 'age: 18}] s = json. Dumps (a) a string # # s' [{" age ": 20, "name": "Bob"}, {"age": 18, "name": "Mary"}]' b = json.loads(s) b[0] # {'age': 20, 'name': 'Bob'} b[0].get('age') # 20Copy the code

When storing to a file, to make the string display more attractive, and there are coding problems, generally add parameters as follows

Save to file

s = json.dumps(a, indent = 4, ensure_ascii=False)
with open('a.json', 'w', encoding = 'utf-8') as f:
    f.write(s)
Copy the code

The indent parameter specifies some indentation, otherwise all the characters written to the file will be stacked together.

Ensure_ascii is specified when the contents of a stackoverflow file are in Chinese.

Read from a file

with open('a.json', encoding = 'utf-8') as f:
    s = f.read()
b = json.loads(s)
Copy the code

Column information

Column home: Programming in Python

Table of contents: table of contents

Crawler directory: Crawler directory

Version description: Software and package version description