Who has a powder? Who will climb! He has a lot of fans, climb him! Python multi-threaded collection of 260,000 + fan data

“This is the 29th day of my participation in the Gwen Challenge in November. See details of the event: The Last Gwen Challenge in 2021”

Who do you want to follow today? Whoever has a lot of fans, they climb. So who has powder? The silent king has a fan.

Today we continue learning about Python crawlers, starting with this blog with a short (15) multi-threaded crawler.

The first article will collect the fans of @Silent King ii, with 27W+ readers, which is really enviable.

Target data source analysis

This should grab the data source is https://blog.csdn.net/qing_gee?type=sub&subType=fans, one of the ID can switch hope collection ID for you, including your own ID.

The slide refresh automatically requests an API, The https://blog.csdn.net/community/home-api/v1/get-fans-list?page=3&size=20&noMore=false&blogUsername=qing_gee, the parameter is as follows:

page: page number, calculated according to the total number of fans of the target / 20;
size: Data per page. The default value is 20.
noMore: useless;
blogUsername: Blog username

At the same time, in the process of testing the interface, the interface will return abnormal data, a delay control is added in the measurement, which can greatly improve the stability of the interface data return.

{'code': 400.'message': 'fail'.'data': None}
Copy the code

Normal interface data is returned as shown in the following figure:

Use technical point description

This time, data was collected using Python multithreading and coded using the threading module for multithreading control. This series of columns will start with the simplest multithreading, such as in this case, launching 5 (customizable) requests at a time.

The complete code is shown below. Please refer to the comment section and the description at the end of the code

import threading
from threading import Lock, Thread
import time
import os
import requests
import random


class MyThread(threading.Thread) :
    def __init__(self, name) :
        super(MyThread, self).__init__()
        self.name = name

    def run(self) :
        global urls
        lock.acquire()
        one_url = urls.pop()
        print("Crawling:", one_url)
        lock.release()
        print("Any thread waits for a random time")
        time.sleep(random.randint(1.3))
        res = requests.get(one_url, headers=self.get_headers(), timeout=5)

        if  res.json()["code"] != 400:
            data = res.json()["data"] ["list"]
            for user in data:
                name = user['username']
                nickname = self.remove_character(user['nickname'])
                userAvatar = user['userAvatar']
                blogUrl = user['blogUrl']
                blogExpert = user['blogExpert']
                briefIntroduction = self.remove_character(
                    user['briefIntroduction'])

	            with open('./qing_gee_data.csv'.'a+', encoding='utf-8') as f:
	                print(f'{name}.{nickname}.{userAvatar}.{blogUrl}.{blogExpert}.{briefIntroduction}')
	                f.write(f"{name}.{nickname}.{userAvatar}.{blogUrl}.{blogExpert}.{briefIntroduction}\n")
        else:
            print(res.json())
            print("Abnormal data", one_url)
            with open('./error.txt'.'a+', encoding='utf-8') as f:
                f.write(one_url+"\n")
    # Remove special characters

    def remove_character(self, origin_str) :
        if origin_str is None:
            return
        origin_str = origin_str.replace('\n'.' ')
        origin_str = origin_str.replace(', '.', ')
        return origin_str
	Get a random UA request header
    def get_headers(self) :
        uas = [
            "Mozilla / 5.0 (compatible; Baiduspider / 2.0; +http://www.baidu.com/search/spider.html)"
        ]
        ua = random.choice(uas)
        Note that the following cookie section needs to be copied manually from the developer tools, otherwise the captured data lacks the nikename and profile section
        headers = {
            "user-agent": ua,
            'cookie': 'UserName = your ID; The UserInfo = the UserInfo you; UserToken = your UserToken; '."referer": "https://blog.csdn.net/qing_gee?type=sub&subType=fans"
        }
        return headers


if __name__ == '__main__':
    lock = Lock()
    url_format = 'https://blog.csdn.net/community/home-api/v1/get-fans-list?page={}&size=20&noMore=false&blogUsername=qing_gee'
    urls = [url_format.format(i) for i in range(1.13300)]
    l = []
    while len(urls) > 0:
        print(len(urls))
        for i in range(5):
            p = MyThread("t"+str(i))
            l.append(p)
            p.start()
        for p in l:
            p.join()
Copy the code

The code running result is as follows:

The code above uses multithreading, as well as thread locking. Simple multithreading code can be abstracted as follows.

Simple multithreaded code:

import threading
import time

def run(n) :
    print('task', n)
    time.sleep(3)

if __name__ == '__main__':
    t1 = threading.Thread(target=run, args=('t1',))
    t2 = threading.Thread(target=run, args=('t2',))
    t1.start()
    t2.start()
Copy the code

Threading.thread, target is the function name, and args is the passed parameter. Note that it must be of a tuple type.

The crawler code still uses shared global variables, and the simplified codes are shown as follows, which focuses on learning part of lock= lock () codes, as well as Lock.acquire () and Lock.release () before and after using global variables. It also uses the thread join method, which basically makes the main thread wait for the child thread to execute.

import threading
from threading import Lock,Thread
import time,os

def work() :
    global urls
    lock.acquire()
    Get a URL
    one_url = urls.pop()
    lock.release()

    print("The URL I get is",one_url)


if __name__ == '__main__':
    lock = Lock()
    url_format = 'https://blog.csdn.net/community/home-api/v1/get-fans-list?page={}&size=20&noMore=false&blogUsername=qing_gee'
    # concatenate URL, global shared variable
    urls = [url_format.format(i) for i in range(1.13300)]
    l = []
    # number of open threads
    for i in range(3):
        p = Thread(target=work)
        l.append(p)
        p.start()
    for p in l:
        p.join()
Copy the code

After getting these data, you can describe the user portrait of an author. This part will be introduced in detail in a separate post in the subsequent blog.

Code in the data cleaning part, there is room for optimization, due to the setting of 13300 pages of data, so finally grab 26W+ data, query, there is a dream eraser.

There are at least 83 blog experts among the followers. You can see that the personal profiles of blog experts are quite clear. At the same time, JIANGtao (the founder of CSDN)

Collection time

The code download address: codechina.csdn.net/hihell/pyth… Can you give me a Star?

Here we are. No comment, no like, no hide?

Who has a powder? Who will climb! He has a lot of fans, climb him! Python multi-threaded collection of 260,000 + fan data

Target data source analysis

Use technical point description

Collection time

Related Posts

NGINX getting started with Enterprise application Practices – Basics

JUC concurrent programming

The second JDK11 | : JShell tools