This article was first published on Zhihu

The main running time consumption of crawler is IO blocking when requesting web pages. Therefore, enabling multi-threading and allowing waiting of different requests at the same time can greatly improve the running efficiency of crawler.

This article is based on multi-threading (10 threads turned on here) and uses github’s API to grab all 5,000 + project information from the Fork cpython project and store the data in a JSON file.

Grab this content from Github and build directly on the non-multithreading version shown in the last article.

Technology required for crawler

  • The Requests library requests web pages, fetches JSON data, parses dictionaries to extract the information we need, and stores JSON files
  • Design multithreading for the web request section using threading
  • Two queues are used to store the url to be captured and the result data after parsing respectively
  • To have a Github account, you need to enter your account number and password in the code
  • Understand decorators (this is just calculating running time, don’t worry)

The crawler code is as follows

import requests
import time
from threading import Thread
from queue import Queue
import json

def run_time(func):
    def wrapper(*args, **kw):
        start = time.time()
        func(*args, **kw)
        end = time.time()
        print('running', end-start, 's')
    return wrapper


class Spider():

    def __init__(self):
        self.qurl = Queue()
        self.data = list()
        self.email = ' ' # login to github
        self.password = ' ' # password used to log in to Github
        self.page_num = 171
        self.thread_num = 10

    def produce_url(self):
        baseurl = 'https://api.github.com/repos/python/cpython/forks?page={}'
        for i in range(1, self.page_num + 1):
            url = baseurl.format(i)
            self.qurl.put(url) The URL is queued for retrieval by other threads

    def get_info(self):
        while not self.qurl.empty(): Ensure that the thread exits after the URL traversal
            url = self.qurl.get() Get the URL from the queue
            print('crawling', url)
            req = requests.get(url, auth = (self.email, self.password))
            data = req.json()
            for datai in data:
                result = {
                    'project_name': datai['full_name'].'project_url': datai['html_url'].'project_api_url': datai['url'].'star_count': datai['stargazers_count']
                }
                self.data.append(result)

    @run_time
    def run(self):
        self.produce_url()

        ths = []
        for _ in range(self.thread_num):
            th = Thread(target=self.get_info)
            th.start()
            ths.append(th)
        for th in ths:
            th.join()

        s = json.dumps(self.data, ensure_ascii=False, indent=4)
        with open('github_thread.json'.'w', encoding='utf-8') as f:
            f.write(s)

        print('Data crawling is finished.')

if __name__ == '__main__':
    Spider().run()
Copy the code

The reader simply needs to specify his/her Github email address and password in the Spider’s __init__ to run the crawler.

The crawler description is as follows

1. The run_time function is a decorator that counts the runtime of the program and operates on the run function of the Spider object

2. The spiders

  • __init__Initialize some constants
  • produce_urlUsed to produce all urls, stored toQueueThe queuequrlIn the. More than 5,000 elements are spread across 171 pages, and these 171 urls are queued for request parsing. There is no need to communicate between multiple threads, so use list insteadQueueQueues are also ok.
  • get_infoWeb page request and parsing, after the opening of multi-threading is multiple this function run at the same time. Functional logic: as long asqurlAnd there are elements in it, every time fromqurlExtract a URL for request parsing and store the resultdataIn the list. The loop exits when there are no more elements in the queue (crawler ends).
  • runCall the function, run the crawler. First callproduce_urlGenerates a URL queue to be climbed. A specified number of threads are then opened, each fromqurlConstantly extracting urls for parsing, storing datadataIn the list. Wait until the URL queue is parseddataThe data is stored in a JSON file

The crawler results

The capture results are shown as follows

The program opened 10 threads and grabbed 171 pages in 33 seconds. 333 seconds were used in this article without multithreading. To get a better sense of how efficient multithreading is, you can try modifying self.page_num and self.thread_num in the code above.

I’ve done an experiment here where self.page_num is set to 20, meaning a total of 20 pages are fetched

  • Run 18.51 seconds on 2 threads
  • Run 7.49 seconds on 5 threads
  • Open 10 threads and run for 3.97 seconds
  • Run for 2.11 seconds on 20 threads

A problem

One final question for the reader to ponder: In the previous article, we implemented a multi-threaded crawler. Why is the code so simple then and so much more complex now?

subsequent

The next article in the multi-threaded crawler will implement the use of multi-threading when turning and grabbing secondary pages.

Welcome to my zhihu column

Column home: Programming in Python

Table of contents: table of contents

Version description: Software and package version description