“This is the 14th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

The ultimate goal of this blog is to get the height, weight, date of birth and blood type of 9139 artists in the world.

Analysis work before climbing

The target website is www.ylq.com/star/list-a… , the data volume is 9000+ artists, which is not very large purely from the data, totaling 153 pages of data.

Page does not return data through the interface, view the page source code can be seen.

The corresponding data of the list page is shown below:From the picture you can see the inside page address, profile picture, name and other intuitive data. Click to enter the inner page, and the corresponding data is shown in the figure below.The idea of sorting and crawling is to grab all the inner pages of artists through the list page, and then get the details after entering the inner pages. The case can be completed by archiving the detailed data.

Encoding time

Single-threaded crawl

All modules involved in this case can be imported first

import requests
import re
import json
import threading
import csv
import codecs
import fcntl This module is not used and can be ignored for now
import time
Copy the code

It is relatively easy to get and parse the page, so this part is explained first.

flag_page = 0
def get_list() :
    global flag_page
    while flag_page < 154:
        time.sleep(1)
        flag_page += 1
        print(F "is crawling{flag_page}Page data")
        url = f"http://www.ylq.com/star/list-all-------{flag_page}.html"
        try:
            r = requests.get(url=url, headers=headers)
            pattern = re.compile(
                r'
  • [.\s]*[.\s]*

    (.*?)

    '
  • ) # Get all artists on the page famous = pattern.findall(r.text) print(famous) except Exception as e: print(e) continue if __name__ == "__main__": get_list() Copy the code

    The above code description:

    • This routine takes advantage of simple multithreading, so declare a global variable ahead of timeflag_pageConvenient for subsequent use
    • Regular expressions require repeated practice. If you cannot match a regular expression at a time, you can match it multiple times

    Optimized for multithreaded mode

    Changing from the current code to multithreading is very simple; you just need to change the function call part of the code. Details are as follows:

    if __name__ == "__main__":
         for i in range(1.6):
            t = threading.Thread(target=get_list)
            t.setName(f't{i}')
            t.start()
    Copy the code

    Create 5 threads, each thread name set tn, thread initialization and use code as follows:

     t = threading.Thread(target=get_list) # initialization
     t.start() # start
    Copy the code

    At this point, the code can be concurrent with 5 threads at the same time, the speed is greatly improved.

    Scraping of the inside pages

    After each thread has retrieved the data, the inner page can be analyzed, extracting the inner page links from the parsed data retrieved above.

    # Get all artists on the page
                famous = pattern.findall(r.text)
                for user in famous:
                    # inside address
                    detail_url = user[0]
                    # print(detail_url)
                    data = get_detail(detail_url)
    
    Copy the code

    Then extend the get_detail function, which mainly uses regular expressions, as follows:

    def get_detail(detail_url) :
    
        r = requests.get(url=detail_url, headers=headers)
        r.encoding = "utf-8"
        html = r.text
        # intercept string
        start = html.find('<div class="sLeft">')
        end = html.find('<div class="sRight">')
        html = html[start:end]
        Get the name and occupation
        name_type = re.search(
            r'

    (? P .*?) (? P .*?)

    '
    , html) # fetch region city = re.search(R '
  • P .*?)
  • '
    , html) high = re.search( ()? (? P .*?) ( )? ', html) weight = re.search(R '
  • P .*?)
  • '
    , html) birthday = re.search(R '
  • P .*?)
  • '
    , html) star = re.search( ()? (? P .*?) ( )? ', html) blood = re.search( ()? (? P .*?) ( )? ', html) detail = { 'name': name_type.group('name'), 'type': name_type.group('type'), 'city': city.group('city'), 'high': high.group('high'), 'weight': weight.group('weight'), 'birthday': birthday.group('birthday'), 'star': star.group('star'), 'blood': blood.group('blood')}return detail Copy the code

    The above will return the matched data to the main function, and after the code runs, we can print out the information we want.Finally, just save the data in a local CSV file. For the complete code, see:

    import requests,re,json,threading,csv,codecs,time
    
    headers = {
        "user-agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
    }
    
    flag_page = 0
    
    def get_detail(detail_url) :
        r = requests.get(url=detail_url, headers=headers)
        r.encoding = "utf-8"
        html = r.text
        # intercept string
        start = html.find('<div class="sLeft">')
        end = html.find('<div class="sRight">')
        html = html[start:end]
        Get the name and occupation
        name_type = re.search(
            r'

    (? P .*?) (? P .*?)

    '
    , html) city = re.search(R '
  • P .*?)
  • '
    , html) high = re.search( ()? (? P .*?) ( )? ', html) weight = re.search(R '
  • P .*?)
  • '
    , html) birthday = re.search(R '
  • P .*?)
  • '
    , html) star = re.search( ()? (? P .*?) ( )? ', html) blood = re.search( ()? (? P .*?) ( )? ', html) detail = { 'name': name_type.group('name'), 'type': name_type.group('type'), 'city': city.group('city'), 'high': high.group('high'), 'weight': weight.group('weight'), 'birthday': birthday.group('birthday'), 'star': star.group('star'), 'blood': blood.group('blood')}return detail def save_face() : pass def save(all_data) : # fcntl.flock(f.eno (), fcntl.lock_ex) # with open('users.csv'.'a+', newline=' ', encoding='utf-8-sig') as f: fieldnames = {'name'.'type'.'city'."high".'weight'.'birthday'.'star'.'blood'} writer = csv.DictWriter(f, fieldnames=fieldnames) for i in all_data: writer.writerow(i) def get_list() : global flag_page # name = threading.currentThread().name # print(f" current thread name = {name}") while flag_page < 154: time.sleep(1) flag_page += 1 print(F "is crawling{flag_page}Page data") url = f"http://www.ylq.com/star/list-all-------{flag_page}.html" try: r = requests.get(url=url, headers=headers) pattern = re.compile( r'
  • [.\s]*[.\s]*

    (.*?)

    '
  • ) famous = pattern.findall(r.text) all_data = [] for user in famous: detail_url = user[0] # print(detail_url) data = get_detail(detail_url) all_data.append(data) save(all_data) except Exception as e: print(e) print(f"{detail_url}Something goes wrong") continue if __name__ == "__main__": with open('users.csv'.'w', newline=' ') as f: fieldnames = {'name'.'type'.'city'."high".'weight'.'birthday'.'star'.'blood'} writer = csv.DictWriter(f, fieldnames=fieldnames) writer.writeheader() for i in range(1.6): t = threading.Thread(target=get_list) t.setName(f't{i}') t.start() Copy the code

    The data shows that

    Data is crawled to local as500KBThe amount of sorting is not large, there are a lot of unknown, if you want the data for analysis, directly run the above code can be successful.Special attention should be paid to the code is regular generic applications, CSV file storage (with column headers), simple multithreaded applications.