“This is the 14th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”
The ultimate goal of this blog is to get the height, weight, date of birth and blood type of 9139 artists in the world.
Analysis work before climbing
The target website is www.ylq.com/star/list-a… , the data volume is 9000+ artists, which is not very large purely from the data, totaling 153 pages of data.
Page does not return data through the interface, view the page source code can be seen.
The corresponding data of the list page is shown below:From the picture you can see the inside page address, profile picture, name and other intuitive data. Click to enter the inner page, and the corresponding data is shown in the figure below.The idea of sorting and crawling is to grab all the inner pages of artists through the list page, and then get the details after entering the inner pages. The case can be completed by archiving the detailed data.
Encoding time
Single-threaded crawl
All modules involved in this case can be imported first
import requests
import re
import json
import threading
import csv
import codecs
import fcntl This module is not used and can be ignored for now
import time
Copy the code
It is relatively easy to get and parse the page, so this part is explained first.
flag_page = 0
def get_list() :
global flag_page
while flag_page < 154:
time.sleep(1)
flag_page += 1
print(F "is crawling{flag_page}Page data")
url = f"http://www.ylq.com/star/list-all-------{flag_page}.html"
try:
r = requests.get(url=url, headers=headers)
pattern = re.compile(
r'[.\s]*[.\s]*(.*?)
' )
# Get all artists on the page
famous = pattern.findall(r.text)
print(famous)
except Exception as e:
print(e)
continue
if __name__ == "__main__":
get_list()
Copy the code
The above code description:
- This routine takes advantage of simple multithreading, so declare a global variable ahead of time
flag_page
Convenient for subsequent use - Regular expressions require repeated practice. If you cannot match a regular expression at a time, you can match it multiple times
Optimized for multithreaded mode
Changing from the current code to multithreading is very simple; you just need to change the function call part of the code. Details are as follows:
if __name__ == "__main__":
for i in range(1.6):
t = threading.Thread(target=get_list)
t.setName(f't{i}')
t.start()
Copy the code
Create 5 threads, each thread name set tn, thread initialization and use code as follows:
t = threading.Thread(target=get_list) # initialization
t.start() # start
Copy the code
At this point, the code can be concurrent with 5 threads at the same time, the speed is greatly improved.
Scraping of the inside pages
After each thread has retrieved the data, the inner page can be analyzed, extracting the inner page links from the parsed data retrieved above.
# Get all artists on the page
famous = pattern.findall(r.text)
for user in famous:
# inside address
detail_url = user[0]
# print(detail_url)
data = get_detail(detail_url)
Copy the code
Then extend the get_detail function, which mainly uses regular expressions, as follows:
def get_detail(detail_url) :
r = requests.get(url=detail_url, headers=headers)
r.encoding = "utf-8"
html = r.text
# intercept string
start = html.find('<div class="sLeft">')
end = html.find('<div class="sRight">')
html = html[start:end]
Get the name and occupation
name_type = re.search(
r'(? P
.*?)
(? P
.*?)
', html)
# fetch region
city = re.search(R ' P
.*?)
', html)
high = re.search(
()? (? P
.*?) (
)? ', html)
weight = re.search(R ' P
.*?)
', html)
birthday = re.search(R ' P
.*?)
', html)
star = re.search(
()? (? P
.*?) (
)? ', html)
blood = re.search(
()? (? P
.*?) (
)? ', html)
detail = {
'name': name_type.group('name'),
'type': name_type.group('type'),
'city': city.group('city'),
'high': high.group('high'),
'weight': weight.group('weight'),
'birthday': birthday.group('birthday'),
'star': star.group('star'),
'blood': blood.group('blood')}return detail
Copy the code
The above will return the matched data to the main function, and after the code runs, we can print out the information we want.Finally, just save the data in a local CSV file. For the complete code, see:
import requests,re,json,threading,csv,codecs,time
headers = {
"user-agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
}
flag_page = 0
def get_detail(detail_url) :
r = requests.get(url=detail_url, headers=headers)
r.encoding = "utf-8"
html = r.text
# intercept string
start = html.find('<div class="sLeft">')
end = html.find('<div class="sRight">')
html = html[start:end]
Get the name and occupation
name_type = re.search(
r'(? P
.*?)
(? P
.*?)
', html)
city = re.search(R ' P
.*?)
', html)
high = re.search(
()? (? P
.*?) (
)? ', html)
weight = re.search(R ' P
.*?)
', html)
birthday = re.search(R ' P
.*?)
', html)
star = re.search(
()? (? P
.*?) (
)? ', html)
blood = re.search(
()? (? P
.*?) (
)? ', html)
detail = {
'name': name_type.group('name'),
'type': name_type.group('type'),
'city': city.group('city'),
'high': high.group('high'),
'weight': weight.group('weight'),
'birthday': birthday.group('birthday'),
'star': star.group('star'),
'blood': blood.group('blood')}return detail
def save_face() :
pass
def save(all_data) :
# fcntl.flock(f.eno (), fcntl.lock_ex) #
with open('users.csv'.'a+', newline=' ', encoding='utf-8-sig') as f:
fieldnames = {'name'.'type'.'city'."high".'weight'.'birthday'.'star'.'blood'}
writer = csv.DictWriter(f, fieldnames=fieldnames)
for i in all_data:
writer.writerow(i)
def get_list() :
global flag_page
# name = threading.currentThread().name
# print(f" current thread name = {name}")
while flag_page < 154:
time.sleep(1)
flag_page += 1
print(F "is crawling{flag_page}Page data")
url = f"http://www.ylq.com/star/list-all-------{flag_page}.html"
try:
r = requests.get(url=url, headers=headers)
pattern = re.compile(
r'[.\s]*[.\s]*(.*?)
' )
famous = pattern.findall(r.text)
all_data = []
for user in famous:
detail_url = user[0]
# print(detail_url)
data = get_detail(detail_url)
all_data.append(data)
save(all_data)
except Exception as e:
print(e)
print(f"{detail_url}Something goes wrong")
continue
if __name__ == "__main__":
with open('users.csv'.'w', newline=' ') as f:
fieldnames = {'name'.'type'.'city'."high".'weight'.'birthday'.'star'.'blood'}
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
for i in range(1.6):
t = threading.Thread(target=get_list)
t.setName(f't{i}')
t.start()
Copy the code
The data shows that
Data is crawled to local as500KB
The amount of sorting is not large, there are a lot of unknown, if you want the data for analysis, directly run the above code can be successful.Special attention should be paid to the code is regular generic applications, CSV file storage (with column headers), simple multithreaded applications.