9139 artists are nothing in front of Python # Python Crawlers lesson 5-9

“This is the 14th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

The ultimate goal of this blog is to get the height, weight, date of birth and blood type of 9139 artists in the world.

Analysis work before climbing

The target website is www.ylq.com/star/list-a… , the data volume is 9000+ artists, which is not very large purely from the data, totaling 153 pages of data.

Page does not return data through the interface, view the page source code can be seen.

The corresponding data of the list page is shown below:From the picture you can see the inside page address, profile picture, name and other intuitive data. Click to enter the inner page, and the corresponding data is shown in the figure below.The idea of sorting and crawling is to grab all the inner pages of artists through the list page, and then get the details after entering the inner pages. The case can be completed by archiving the detailed data.

Encoding time

Single-threaded crawl

All modules involved in this case can be imported first

import requests
import re
import json
import threading
import csv
import codecs
import fcntl This module is not used and can be ignored for now
import time
Copy the code

It is relatively easy to get and parse the page, so this part is explained first.

flag_page = 0
def get_list() :
    global flag_page
    while flag_page < 154:
        time.sleep(1)
        flag_page += 1
        print(F "is crawling{flag_page}Page data")
        url = f"http://www.ylq.com/star/list-all-------{flag_page}.html"
        try:
            r = requests.get(url=url, headers=headers)
            pattern = re.compile(
                r'[.\s]*[.\s]*(.*?) 
')
            # Get all artists on the page
            famous = pattern.findall(r.text)
            print(famous)
        except Exception as e:
            print(e)

            continue

if __name__ == "__main__":
    get_list()
Copy the code

The above code description:

This routine takes advantage of simple multithreading, so declare a global variable ahead of timeflag_pageConvenient for subsequent use
Regular expressions require repeated practice. If you cannot match a regular expression at a time, you can match it multiple times

Optimized for multithreaded mode

Changing from the current code to multithreading is very simple; you just need to change the function call part of the code. Details are as follows:

if __name__ == "__main__":
     for i in range(1.6):
        t = threading.Thread(target=get_list)
        t.setName(f't{i}')
        t.start()
Copy the code

Create 5 threads, each thread name set tn, thread initialization and use code as follows:

 t = threading.Thread(target=get_list) # initialization
 t.start() # start
Copy the code

At this point, the code can be concurrent with 5 threads at the same time, the speed is greatly improved.

Scraping of the inside pages

After each thread has retrieved the data, the inner page can be analyzed, extracting the inner page links from the parsed data retrieved above.

# Get all artists on the page
            famous = pattern.findall(r.text)
            for user in famous:
                # inside address
                detail_url = user[0]
                # print(detail_url)
                data = get_detail(detail_url)

Copy the code

Then extend the get_detail function, which mainly uses regular expressions, as follows:

def get_detail(detail_url) :

    r = requests.get(url=detail_url, headers=headers)
    r.encoding = "utf-8"
    html = r.text
    # intercept string
    start = html.find('<div class="sLeft">')
    end = html.find('<div class="sRight">')
    html = html[start:end]
    Get the name and occupation
    name_type = re.search(
        r'(? P
       
        .*?) 
        (? P
         
          .*?) 
         
       
', html)
	# fetch region
    city = re.search(R ' P
        
         .*?) 
        
', html)
    high = re.search(
        ()? (? P
       
        .*?) (
       )? ', html)
    weight = re.search(R ' P
        
         .*?) 
        
', html)
    birthday = re.search(R ' P
        
         .*?) 
        
', html)
    star = re.search(
        ()? (? P
       
        .*?) (
       )? ', html)
    blood = re.search(
         ()? (? P
        
         .*?) (
        )? ', html)

    detail = {
        'name': name_type.group('name'),
        'type': name_type.group('type'),
        'city': city.group('city'),
        'high': high.group('high'),
        'weight': weight.group('weight'),
        'birthday': birthday.group('birthday'),
        'star': star.group('star'),
        'blood': blood.group('blood')}return detail

Copy the code

The above will return the matched data to the main function, and after the code runs, we can print out the information we want.Finally, just save the data in a local CSV file. For the complete code, see:

import requests,re,json,threading,csv,codecs,time

headers = {
    "user-agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
}

flag_page = 0

def get_detail(detail_url) :
    r = requests.get(url=detail_url, headers=headers)
    r.encoding = "utf-8"
    html = r.text
    # intercept string
    start = html.find('<div class="sLeft">')
    end = html.find('<div class="sRight">')
    html = html[start:end]
    Get the name and occupation
    name_type = re.search(
        r'(? P
       
        .*?) 
        (? P
         
          .*?) 
         
       
', html)

    city = re.search(R ' P
        
         .*?) 
        
', html)
    high = re.search(
        ()? (? P
       
        .*?) (
       )? ', html)
    weight = re.search(R ' P
        
         .*?) 
        
', html)
    birthday = re.search(R ' P
        
         .*?) 
        
', html)
    star = re.search(
        ()? (? P
       
        .*?) (
       )? ', html)
    blood = re.search(
         ()? (? P
        
         .*?) (
        )? ', html)

    detail = {
        'name': name_type.group('name'),
        'type': name_type.group('type'),
        'city': city.group('city'),
        'high': high.group('high'),
        'weight': weight.group('weight'),
        'birthday': birthday.group('birthday'),
        'star': star.group('star'),
        'blood': blood.group('blood')}return detail

def save_face() :
    pass

def save(all_data) :
    # fcntl.flock(f.eno (), fcntl.lock_ex) #
    with open('users.csv'.'a+', newline=' ', encoding='utf-8-sig') as f:
        fieldnames = {'name'.'type'.'city'."high".'weight'.'birthday'.'star'.'blood'}
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        for i in all_data:
            writer.writerow(i)

def get_list() :
    global flag_page
    # name = threading.currentThread().name
    # print(f" current thread name = {name}")
    while flag_page < 154:
        time.sleep(1)
        flag_page += 1
        print(F "is crawling{flag_page}Page data")
        url = f"http://www.ylq.com/star/list-all-------{flag_page}.html"
        try:
            r = requests.get(url=url, headers=headers)
            pattern = re.compile(
                r'[.\s]*[.\s]*(.*?) 
')
            famous = pattern.findall(r.text)
            all_data = []
            for user in famous:
                detail_url = user[0]
                # print(detail_url)
                data = get_detail(detail_url)
                all_data.append(data)

            save(all_data)
        except Exception as e:
            print(e)
            print(f"{detail_url}Something goes wrong")
            continue

if __name__ == "__main__":
    with open('users.csv'.'w', newline=' ') as f:
        fieldnames = {'name'.'type'.'city'."high".'weight'.'birthday'.'star'.'blood'}
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
    for i in range(1.6):
        t = threading.Thread(target=get_list)
        t.setName(f't{i}')
        t.start()

Copy the code

The data shows that

Data is crawled to local as500KBThe amount of sorting is not large, there are a lot of unknown, if you want the data for analysis, directly run the above code can be successful.Special attention should be paid to the code is regular generic applications, CSV file storage (with column headers), simple multithreaded applications.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

9139 artists are nothing in front of Python # Python Crawlers lesson 5-9

Analysis work before climbing

Encoding time

Single-threaded crawl

(.*?)

Optimized for multithreaded mode

Scraping of the inside pages

(? P .?) (? P .?)

(? P .?) (? P .?)

(.*?)

The data shows that

9139 artists are nothing in front of Python # Python Crawlers lesson 5-9

Analysis work before climbing

Encoding time

Single-threaded crawl

(.*?)

Optimized for multithreaded mode

Scraping of the inside pages

(? P .*?) (? P .*?)

(? P .*?) (? P .*?)

(.*?)

The data shows that

Related Posts

GaussDB(DWS) performance tuning: SQL rewrite

Common code refactoring tips

Xxl-job Code analysis of upstream and downstream source of automatic induction actuator

(? P .?) (? P .?)

(? P .?) (? P .?)