Multi-threaded, multi-process crawlers for Python

This is my fifth day of the Gwen Challenge

A, reason

Multithreading is remarkable for crawlers, and there are a few things we need to know when using Multithreading in Python:

1. Multi-threading in Python is not like multi-threading in Java. The difference is that when the Python interpreter starts to execute a task, Python threads are limited to an execution model that allows only one program to execute at a time, subject to the global interpreter (GIL).

2.Python threads are better suited for handling I/O and other blocking operations that need to be distributed (such as waiting for I/O, waiting to fetch data from a database, etc.) than computationally intensive tasks that require multiple processor lines. Fortunately, crawlers spend most of their time interacting with the web, so you can use multiple threads to write crawlers.

3. This has little to do with multithreading. Scrapy concurrency is not implemented by multithreading, it’s a Twisted application that uses asynchronous non-blocking concurrency.

4. When you want to improve execution efficiency in Python, most developers use multiprocessing to improve execution efficiency. Use multiprocessing for parallel programming, of course, you can write multiprocess crawlers to crawl information.

5. What are the disadvantages of using threads? The disadvantages are that when you write multithreaded code, you need to pay attention to deadlocks, blocking, and communication between multiple threads (to avoid multiple threads performing the same task).

Through threading

We write multithreaded code through the threading module, or you can do the same using from Concurrent. futures import ThreadPoolExecutor, which I’ll talk about later.

Let’s start with a simple example of how to write thread code and how it can be used:

import time
from threading import Thread

def countdown(n) :
	while n > 0:
		print('T-minus', n)
		n -= 1
		time.sleep(5)
	
t = Thread(target=countdown, args=(10,))
t.start()
Copy the code

Countdown is a counting method that executes properly, normally using countdown(10). When you call it from a Thread, you need to import threads from the threading module. T = Thread(target=countdown, args=(10,)), once you have created a Thread object, it will not execute immediately unless you call its start method (when you call start(), it will call the function you passed in, And pass the data you pass in to the function), which is a simple example of thread execution.

You can query the state of a thread object to see if it is still executing:

if t.is_alive():
	print('Still running')
else:
	print('Completed, Go out ! ')
Copy the code

The Python interpreter continues to run until all threads have terminated. For long-running threads or background tasks that need to run all the time, you should consider using background threads.

t = Thread(target=countdown, args=(10,), daemon=True)
t.start()
Copy the code

If you need to terminate a thread, the thread must be programmed to exit by polling at a particular point. You can put threads into a class like this:

class CountDownTask:
    def __init__(self) :
        self._running = True

    def terminate(self) :
        self._running = False

    def run(self, n) :
        while self._running and n > 0:
            print('T-minus', n)
            n -= 1
            time.sleep(5)

if __name__ == '__main__':
    c = CountDownTask()
    t = Thread(target=c.run, args=(10,))
    t.start()
    c.terminate()
    t.join()
Copy the code

Third, the first experience of multi-threading

The code above is single-threaded, let’s look at a multi-threaded, and use it to write multi-threaded crawler, however, before actually writing multithreaded crawler, we need to prepare for writing multithreaded, how to keep the communication between each thread, here, we use the Queue Queue as a bridge of communication between threads.

First, we create a Queue object shared by multiple threads that add or remove elements to the Queue by using put() and get().

from queue import Queue
from threading import Thread

def producer(out_q) :
    while True:
        out_q.put(1)


def consumer(in_q) :
    while True:
        data = in_q.get()
        
if __name__ == '__main__':
    q = Queue()
    t1 = Thread(target=consumer, args=(q, ))
    t2 = Thread(target=producer, args=(q, ))
    t1.start()
    t2.start()
Copy the code

The produecer(producer) and consumer(consumer) above are two different threads that share a single queue: Q. When the producer produces data, the consumer gets it and consumes it, so don’t worry about producing the same data. It is worth noting that although columns are the most common mechanism for interthread communication, you can still implement interthread communication by creating your own data structures and adding the necessary locking and synchronization mechanisms.

Here we write a simple multithreaded crawler, method to write a bloated, normally should not write this, as a simple example I will write so:

import re
import time
import requests
import threading
from lxml import etree
from bs4 import BeautifulSoup
from queue import Queue
from threading import Thread


def run(in_q, out_q) :
    headers = {
        'Accept': ' '.'Accept-Language': 'zh-CN,zh; Q = 0.9, en. Q = 0.8 '.'Connection': 'keep-alive'.'Cookie': ' '.'DNT': '1'.'Host': 'www.g.com'.'Referer': ' '.'Upgrade-Insecure-Requests': '1'.'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 '
                      '(KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
    }
    while in_q.empty() is not True:
        data = requests.get(url=in_q.get(), headers=headers)
        r = data.content
        content = str(r, encoding='utf-8', errors='ignore')
        soup = BeautifulSoup(content, 'html5lib')
        fixed_html = soup.prettify()
        html = etree.HTML(fixed_html)
        nums = html.xpath('//div[@class="col-md-1"]//text()')
        for num in nums:
            num = re.findall('[0-9]'.' '.join(num))
            real_num = int(' '.join(num))
            out_q.put(str(threading.current_thread().getName()) + The '-' + str(real_num))
        in_q.task_done()


if __name__ == '__main__':
    start = time.time()
    queue = Queue()
    result_queue = Queue()
    for i in range(1.1001):
        queue.put('http://www.g.com?page='+str(i))
    print('Queue start size %d' % queue.qsize())

    for index in range(10):
        thread = Thread(target=run, args=(queue, result_queue, ))
        thread.daemon = True  Exit as the main thread exits
        thread.start()

    queue.join()  The thread terminates when the queue finishes consuming
    end = time.time()
    print('Total time: %s' % (end - start))
    print('Queue end size %d' % queue.qsize())
    print('result_queue end size %d' % result_queue.qsize())

Copy the code

First, construct a task queue, a queue to hold the results.

Construct a 1000 page task queue.Ten threads are used to execute the run method to digest the task queue. The run method takes two parameters, a task queue and a queue to store the results.Headers is the request head of a crawler.In_q.empty (), a method on columns that checks if the queue is empty, is a Boolean value url = in_q.get(), which takes a value from the queue and then removes it from the queue.Out_q. put Adds to a queue, which is equivalent to the append operation of a list. In_q. task_done(), notifying the column that the task has completed. That completes a multi-threaded crawler. Time consumed per thread:Time consumed by multithreading:As you can see, there are a lot of improvements. Life is short, and some things are faster than others.

4. Producer consumer crawler

Implement a simple producer consumer crawler:

# Description: Python multithreaded - plain multithreaded - producer-consumer model

import re
import time
import requests
import threading
from lxml import etree
from bs4 import BeautifulSoup
from queue import Queue
from threading import Thread


def producer(in_q) :  # producers
    ready_list = []
    while in_q.full() is False:
        for i in range(1.1001):
            url = 'http://www.g.com/?page='+str(i)
            if url not in ready_list:
                ready_list.append(url)
                in_q.put(url)
            else:
                continue


def consumer(in_q, out_q) :  # consumers
    headers = {
        'Accept': '', 'Accept-Language':'zh-CN,zh; q=0.9,en; q=0.8', 'Connection':'keep-alive', 'Cookie':'.'DNT': '1'.'Host': 'www.. com'.'Referer': 'http://www.g.com'.'Upgrade-Insecure-Requests': '1'.'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 '
                      '(KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
    }
    while True:
        data = requests.get(url=in_q.get(), headers=headers)
        r = data.content
        content = str(r, encoding='utf-8', errors='ignore')
        soup = BeautifulSoup(content, 'html5lib')
        fixed_html = soup.prettify()
        html = etree.HTML(fixed_html)
        nums = html.xpath('//div[@class="col-md-1"]//text()')
        for num in nums:
            num = re.findall('[0-9]'.' '.join(num))
            real_num = int(' '.join(num))
            out_q.put(str(threading.current_thread().getName()) + The '-' + str(real_num))
        in_q.task_done()  Notify the producer that the queue is digested


if __name__ == '__main__':
    start = time.time()
    queue = Queue(maxsize=10)  Set the queue size to 10
    result_queue = Queue()
    print('Queue start size %d' % queue.qsize())

    producer_thread = Thread(target=producer, args=(queue,))
    producer_thread.daemon = True
    producer_thread.start()

    for index in range(10):
        consumer_thread = Thread(target=consumer, args=(queue, result_queue, ))
        consumer_thread.daemon = True
        consumer_thread.start()

    queue.join()
    end = time.time()
    print('Total time: %s' % (end - start))
    print('Queue end size %d' % queue.qsize())
    print('result_queue end size %d' % result_queue.qsize())


Copy the code

One thread performs production, the producer produces ten urls at a time, and ten threads perform consumption. The code is very simple. After reading the above code, this part of the code is not difficult to understand, SO I won’t explain.

5. Multi-process crawler

Multiprocessing in Python is similar to multithreading in many ways. For example, the methods are roughly or even exactly the same. Why multiprocess? When you want to improve the efficiency of cpu-intensive tasks, you can improve the situation by using multi-process. In this case, I’m using process pools to write multi-process crawlers. If you don’t want to use process pools, writing multi-process tasks is similar to multithreading, except that, You need to use the from Multiprocessing import Process method, which is similar to the Thread method, so I won’t talk about it here. Some of the things about multiprocessing are in the comments of the code, please check it out.


import re
import time
import requests
from lxml import etree
from bs4 import BeautifulSoup
import multiprocessing


def run(in_q, out_q) :
    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng, * / *; Q = 0.8 '
                  ',application/signed-exchange; v=b3'.'Accept-Language': 'zh-CN,zh; Q = 0.9, en. Q = 0.8 '.'Connection': 'keep-alive'.'Cookie': '', 'DNT':'1', 'Host':'www.g.com', 'Referer': 'http://www.g.com', 'Upgrade-Insecure-Requests':'1', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' '(KHTML, like Gecko) Chrome/73.03683.103. Safari/537.36' } while in_q.empty() is not True: data = requests.get(url=in_q.get(), headers=headers) r = data.content content = str(r, encoding='utf-8', errors='ignore') soup = BeautifulSoup(content, 'html5lib') fixed_html = soup.prettify() html = etree.HTML(fixed_html) nums = html.xpath('//div[@class="col-md1 "] / /text() ')for num in nums:
            num = re.findall('[0-9]'.' '.join(num))
            real_num = int(' '.join(num))
            out_q.put(str(real_num))
        in_q.task_done()
        return out_q


if __name__ == '__main__':
    start = time.time()
    queue = multiprocessing.Manager().Queue()
    result_queue = multiprocessing.Manager().Queue()

    for i in range(1.1001):
        queue.put('http://www.g.com2?page='+str(i))
    print('Queue start size %d' % queue.qsize())

    pool = multiprocessing.Pool(10)  Asynchronous process pool (non-blocking)
    for index in range(1000) :The "For" loop executes the following steps: (1) loop through, adding 1000 child processes to the process pool (relative parent processes block) (2) execute 10 child processes at a time, and start new child processes as soon as one child process finishes executing. Apply_async is a pool of asynchronous processes. Asynchrony refers to the process of starting the child process asynchronously with the execution of the parent process itself (crawler operation), whereas the process of adding the child process to the process pool in the For loop is synchronous with the execution of the parent process itself. ' ' '
        pool.apply_async(run, args=(queue, result_queue,))   When a process is finished, start a new process.
    pool.close()
    pool.join()
    
    queue.join()  The thread terminates when the queue finishes consuming
    end = time.time()

    print('Total time: %s' % (end - start))
    print('Queue end size %d' % queue.qsize())
    print('result_queue end size %d' % result_queue.qsize())

Copy the code

Multi-process running results: