Hello, I’m Jiannan!

I crashed a fiction website for a tutorial, which was a real jump, with 503 error code every time, meaning the server was not accessible because I had written a crawler using coroutines.

Note: this article only provides learning to use, do not destroy the network, otherwise the consequences!!

Because the server could not accept so much pressure, the resources were temporarily inaccessible, so when I stopped the crawler, the novel website gradually returned to normal.

If you have read my blog, you will find for multithreaded, queues, multi-process article I only summarizes a respectively, but about coroutines, today is the fifth time I wrote, to be honest, the pit of coroutines involves too much, also not easy, need time to summarize the problems and optimize the code before.

About multithreading, multi-process, queue and other knowledge, I now use less, so the summary of only one, hope readers forgive me.

coroutines

The nature of coroutines is single thread, it just takes advantage of the delay time in the program, in the continuous switching of code blocks executed. Coroutine switching task is highly efficient and takes advantage of the time of thread delay waiting, so it is preferred to use coroutine in actual processing.

Introduction to asynchronous HTTP framework HTTPX

For those of you who are not familiar with coroutines, you might want to check out my previous articles and do a quick overview. The Requests library is familiar, but HTTP requests are synchronous requests, and the I/O blocking nature of HTTP requests makes it ideal for asynchronous HTTP requests using coroutines.

HTTPX inherits all the features of Requests and supports an open source library for asynchronous HTTP requests.

Install HTTPX

pip install httpx
Copy the code

practice

Let’s take a look at the results by comparing the time taken to batch HTTP requests using HTTPX synchronous versus asynchronous.

import httpx
import threading
import time


def send_requests(url, sign) :
    status_code = httpx.get(url).status_code
    print(f'send_requests:{threading.current_thread()}:{sign}: {status_code}')


start = time.time()
url = 'http://www.httpbin.org/get'
[send_requests(url, sign=i) for i in range(200)]
end = time.time()
print('Runtime:'.int(end - start))
Copy the code

The code is simple, and you can see that send_requests has synchronized access to the target address 200 times.

Part of the running results are as follows:

send_requests:<_MainThread(MainThread, started 9552)>:191: 200 send_requests:<_MainThread(MainThread, started 9552)>:192: 200 send_requests:<_MainThread(MainThread, started 9552)>:193: 200 send_requests:<_MainThread(MainThread, started 9552)>:194: 200 send_requests:<_MainThread(MainThread, started 9552)>:195: 200 send_requests:<_MainThread(MainThread, started 9552)>:196: 200 send_requests:<_MainThread(MainThread, started 9552)>:197: 200 send_requests:<_MainThread(MainThread, started 9552)>:198: Send_requests :<_MainThread(MainThread, started 9552)>: 199:200 run time: 102Copy the code

As you can see from the results, the main thread executes sequentially because it is a synchronous request.

The procedure took 102 seconds.

Here it comes, here it comes, let’s try an asynchronous HTTP request and see what surprises it can bring.

import asyncio
import httpx
import threading
import time


client = httpx.AsyncClient()


async def async_main(url, sign) :

    response = await client.get(url)
    status_code = response.status_code
    print(f'{threading.current_thread()}:{sign}:{status_code}')


def main() :
    loop = asyncio.get_event_loop()
    tasks = [async_main(url='https://www.baidu.com', sign=i) for i in range(200)]
    async_start = time.time()
    loop.run_until_complete(asyncio.wait(tasks))
    async_end = time.time()
    loop.close()
    print('Runtime:', async_end-async_start)


if __name__ == '__main__':
    main()

Copy the code

Part of the running results are as follows:

<_MainThread(MainThread, started 13132)>:113:200 <_MainThread(MainThread, started 13132)>:51:200 <_MainThread(MainThread, started 13132)>:176:200 <_MainThread(MainThread, started 13132)>:174:200 <_MainThread(MainThread, started 13132)>:114:200 <_MainThread(MainThread, Started 13132)>: 49:2000 <_MainThread(MainThread, started 13132)>: 52:2000 runtime: 1.4899322986602783Copy the code

Did you get a shock when you saw this running time? In just over a second, you visited Baidu 200 times. Fast enough to fly.

Limit concurrency

If asyncio is combined with HTTPX, the server will crash if the number of concurrent requests is too high.

Using a Semaphore

Asyncio actually comes with a class called Semaphore that limits the number of coroutines. All we need to do is initialize it, pass in the maximum number of coroutines allowed, and then go through the context manager. The specific code is as follows:

import asyncio
import httpx
import time


async def send_requests(delay, sem) :
    print(F prime requests a delay of{delay}Second interface ')
    await asyncio.sleep(delay)
    async with sem:
        Execute concurrent code
        async with httpx.AsyncClient(timeout=20) as client:
            resp = await client.get('http://www.httpbin.org/get')
            print(resp)


async def main() :
    start = time.time()
    delay_list = [3.6.1.8.2.4.5.2.7.3.9.8]
    task_list = []
    sem = asyncio.Semaphore(3)
    for delay in delay_list:
        task = asyncio.create_task(send_requests(delay, sem))
        task_list.append(task)
    await asyncio.gather(*task_list)
    end = time.time()
    print('Total time:', end-start)


asyncio.run(main())
Copy the code

Part of the running results are as follows:

<Response [200 OK]> <Response [200 OK]> <Response [200 OK]> <Response [200 OK]> <Response [200 OK]> <Response [200 OK]> <Response [200 OK]> <Response [200 OK]> Total time: 9.540421485900879Copy the code

But what if you want to have only three coroutines in a minute?

Just change the code to something like this:

async def send_requests(delay, sem) :
    print(F prime requests a delay of{delay}Second interface ')
    await asyncio.sleep(delay)
    async with sem:
        Execute concurrent code
        async with httpx.AsyncClient(timeout=20) as client:
            resp = await client.get('http://www.httpbin.org/get')
            print(resp)
    await asyncio.sleep(60)
Copy the code

conclusion

If you want to limit the number of concurrent coroutines, the easiest way is to use Semaphore. Note, however, that this class can only be initialized before the coroutine is started and then passed to the coroutine, ensuring that the concurrent coroutine gets the same Semaphore object.

Of course, it is possible to have different parts in the program, and the number of concurrent parts may be different, so you need to initialize multiple Semaphore objects.

Actual combat – Pen fun pavilion

Web analytics

First, on the homepage of the novel, it can be found that all the chapter links of the novel are in the href attribute of the A tag under the DD tag.

The first step is to get all the chapter links.

The next step is to go into each chapter and capture the content.

As can be seen from the above figure, the content of the article is in

In the tag, you can find a lot of line breaks in the image, so you need to do further whitespace removal when writing code.

Get the source code of the web page

async def get_home_page(url, sem) :
    async with sem:
        async with httpx.AsyncClient(timeout=20) as client:
            resp = await client.get(url)
            resp.encoding = 'utf-8'
            html = resp.text
            return html
Copy the code

Get all chapter links

async def parse_home_page(sem) :
    async with sem:
        url = 'https://www.biqugeu.net/13_13883/'
        html = etree.HTML(await get_home_page(url, sem))
        content_urls = ['https://www.biqugeu.net/' + url for url in html.xpath('//dd/a/@href')]
        return content_urls
Copy the code

Now, notice that I’m doing one more thing here and that’s concatenating the URL, because the URL that we’re grabbing is not complete so we need to do a simple concatenating.

Save the data

async def data_save(url, sem) :
    async with sem:
        html = etree.HTML(await get_home_page(url, sem))
        title = html.xpath('//h1/text()') [0]
        contents = html.xpath('//div[@id="content"]/text()')
        print(F 'is downloading{title}')
        for content in contents:
            text = ' '.join(content.split())

            with open(2 / f '. / gold branches{title}.txt'.'a', encoding='utf-8') as f:
                f.write(text)
                f.write('\n')
Copy the code

The above urls are passed to the data_save() function, which parses each url, retrieves the text content, and saves it.

Create a coroutine task

async def main() :
    sem = asyncio.Semaphore(20)
    urls = await parse_home_page(sem)
    tasks_list = []
    for url in urls:
        task = asyncio.create_task(data_save(url, sem))
        tasks_list.append(task)
    await asyncio.gather(*tasks_list)
Copy the code

The results show

In less than a minute, all the novels were captured. Just imagine how long it would take for an ordinary reptile.

At least 737 seconds!!

The last

This is going to be the last time I write a coroutine, okay? Definitely not. There is also an article about the asynchronous network request library Aiohttp, which I will share with you when I come across it.

This share ends here, if you see here, I hope you can give me a “like” and “look again”, if you can, please share with more people to learn together.

Every word of the article is written by my heart, and your “thumbs up” will let me know that you are the one who works hard with me.