This article was first published on Zhihu
The main running time consumption of crawler is IO blocking when requesting web pages. Therefore, enabling multi-threading and allowing waiting of different requests at the same time can greatly improve the running efficiency of crawler.
This article is based on multi-threading (10 threads turned on here) and uses github’s API to grab all 5,000 + project information from the Fork cpython project and store the data in a JSON file.
Grab this content from Github and build directly on the non-multithreading version shown in the last article.
Technology required for crawler
- The Requests library requests web pages, fetches JSON data, parses dictionaries to extract the information we need, and stores JSON files
- Design multithreading for the web request section using threading
- Two queues are used to store the url to be captured and the result data after parsing respectively
- To have a Github account, you need to enter your account number and password in the code
- Understand decorators (this is just calculating running time, don’t worry)
The crawler code is as follows
import requests
import time
from threading import Thread
from queue import Queue
import json
def run_time(func):
def wrapper(*args, **kw):
start = time.time()
func(*args, **kw)
end = time.time()
print('running', end-start, 's')
return wrapper
class Spider():
def __init__(self):
self.qurl = Queue()
self.data = list()
self.email = ' ' # login to github
self.password = ' ' # password used to log in to Github
self.page_num = 171
self.thread_num = 10
def produce_url(self):
baseurl = 'https://api.github.com/repos/python/cpython/forks?page={}'
for i in range(1, self.page_num + 1):
url = baseurl.format(i)
self.qurl.put(url) The URL is queued for retrieval by other threads
def get_info(self):
while not self.qurl.empty(): Ensure that the thread exits after the URL traversal
url = self.qurl.get() Get the URL from the queue
print('crawling', url)
req = requests.get(url, auth = (self.email, self.password))
data = req.json()
for datai in data:
result = {
'project_name': datai['full_name'].'project_url': datai['html_url'].'project_api_url': datai['url'].'star_count': datai['stargazers_count']
}
self.data.append(result)
@run_time
def run(self):
self.produce_url()
ths = []
for _ in range(self.thread_num):
th = Thread(target=self.get_info)
th.start()
ths.append(th)
for th in ths:
th.join()
s = json.dumps(self.data, ensure_ascii=False, indent=4)
with open('github_thread.json'.'w', encoding='utf-8') as f:
f.write(s)
print('Data crawling is finished.')
if __name__ == '__main__':
Spider().run()
Copy the code
The reader simply needs to specify his/her Github email address and password in the Spider’s __init__ to run the crawler.
The crawler description is as follows
1. The run_time function is a decorator that counts the runtime of the program and operates on the run function of the Spider object
2. The spiders
__init__
Initialize some constantsproduce_url
Used to produce all urls, stored toQueue
The queuequrl
In the. More than 5,000 elements are spread across 171 pages, and these 171 urls are queued for request parsing. There is no need to communicate between multiple threads, so use list insteadQueue
Queues are also ok.get_info
Web page request and parsing, after the opening of multi-threading is multiple this function run at the same time. Functional logic: as long asqurl
And there are elements in it, every time fromqurl
Extract a URL for request parsing and store the resultdata
In the list. The loop exits when there are no more elements in the queue (crawler ends).run
Call the function, run the crawler. First callproduce_url
Generates a URL queue to be climbed. A specified number of threads are then opened, each fromqurl
Constantly extracting urls for parsing, storing datadata
In the list. Wait until the URL queue is parseddata
The data is stored in a JSON file
The crawler results
The capture results are shown as follows
The program opened 10 threads and grabbed 171 pages in 33 seconds. 333 seconds were used in this article without multithreading. To get a better sense of how efficient multithreading is, you can try modifying self.page_num and self.thread_num in the code above.
I’ve done an experiment here where self.page_num is set to 20, meaning a total of 20 pages are fetched
- Run 18.51 seconds on 2 threads
- Run 7.49 seconds on 5 threads
- Open 10 threads and run for 3.97 seconds
- Run for 2.11 seconds on 20 threads
A problem
One final question for the reader to ponder: In the previous article, we implemented a multi-threaded crawler. Why is the code so simple then and so much more complex now?
subsequent
The next article in the multi-threaded crawler will implement the use of multi-threading when turning and grabbing secondary pages.
Welcome to my zhihu column
Column home: Programming in Python
Table of contents: table of contents
Version description: Software and package version description