This article uses the Beautifulsoup4 library to grab the latest Python issues on StackOverflow and store them in a JSON file. The first half exercises the use of Beautifulsoup by crawling multiple fields, and the second half introduces the JSON module
Beautifulsoup is a comprehensive summary of its use in this article
The crawler code
Import requests from bS4 import BeautifulSoup import re import JSON class Stack(object): def __init__(self): Self. Baseurl = 'https://stackoverflow.com' # to grab the url of the splicing self. Starturl = 'https://stackoverflow.com/questions/tagged/python' initial url # def start_requests (self, url) : R = requests. Get (url) return r.content def parse(self, text): BeautifulSoup = BeautifulSoup(text, 'html.parser') divs = soup. Find_all ('div', class_ = 'question-summary') for div in divs: Find ('span', title = re.compile('gold')) silver = div. Find ('span', title = re.compile('silver')) bronze = div.find('span', title = re.compile('bronze')) tags = div.find('div', Class_ = 'summary').find_all('div')[1].find_all('a') # yield {# self.baseurl + div.h3.a.get('href'), 'answer': div.find('div', class_ = re.compile('status')).strong.text, 'view': div.find('div', class_ = 'views ').text[: -7].strip(), 'gold': gold.find('span', class_ = 'badgecount').text if gold else 0, 'tagnames': [tag. Text for tag in tags], # the following is the same as above: div.find('span', class_ = 'vote-count-post ').strong.text, 'time': div.find('div', class_ = 'user-action-time').span.get('title'), 'duration': div.find('div', class_ = 'user-action-time').span.text, 'username': div.find('div', class_ = 'user-details').a.text, 'userurl': self.baseurl + div.find('div', class_ = 'user-gravatar32').a.get('href'), 'reputation': div.find('span', class_ = 'reputation-score').text, 'silver': silver.find('span', class_ = 'badgecount').text if silver else 0, 'bronze': bronze.find('span', class_ = 'badgecount').text if bronze else 0, 'tagurls': [self.baseurl + tag.get('href') for tag in tags]} # def start(self): text = self.start_requests(self.starturl) items = self.parse(text) s = json.dumps(list(items), indent = 4, ensure_ascii=False) with open('stackoverflow.json', 'w', encoding = 'utf-8') as f: f.write(s) stack = Stack() stack.start()Copy the code
The capture results are shown in the figure below
The basics of the above code have been covered in previous articles, so please refer to the following article if you have any questions
- Basic principles of crawlers
- Beautifulsoup,
- Use of classes and generators
Json module Introduction
Json is a built-in module that does not need to be installed by itself. The module mainly uses two functions json.dumps and json.loads
- The former can turn a List dict Python object into a string that looks the same. Such conversions are generally used to store objects in JSON files, which have the same form as list dict objects, and require strings (or bytes) to store files.
- The latter converts a List dict-like string into a Python object, which is what you get if you read a JSON file, and turns it into a List dict that Python can handle
Sample code is shown below
Import json a = [{' name ':' Bob ', 'age: 20}, {' name' : 'Mary', 'age: 18}] s = json. Dumps (a) a string # # s' [{" age ": 20, "name": "Bob"}, {"age": 18, "name": "Mary"}]' b = json.loads(s) b[0] # {'age': 20, 'name': 'Bob'} b[0].get('age') # 20Copy the code
When storing to a file, to make the string display more attractive, and there are coding problems, generally add parameters as follows
Save to file
s = json.dumps(a, indent = 4, ensure_ascii=False)
with open('a.json', 'w', encoding = 'utf-8') as f:
f.write(s)
Copy the code
The indent parameter specifies some indentation, otherwise all the characters written to the file will be stacked together.
Ensure_ascii is specified when the contents of a stackoverflow file are in Chinese.
Read from a file
with open('a.json', encoding = 'utf-8') as f:
s = f.read()
b = json.loads(s)
Copy the code
Column information
Column home: Programming in Python
Table of contents: table of contents
Crawler directory: Crawler directory
Version description: Software and package version description