This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money.
Cute picture net dual thread crawl
This blog post is going to speed up the Python crawler by implementing a two-threaded crawler. And in the process of crawling, there are unexpected goods.
Crawl target analysis
Crawl target
- Lovely pictures website www.keaitupian.net/
- Picture classification is very rich, want to grab all have, such as lovely girls, sexy beauty, but in order to better learning technology, I decided to only grab cartoon cartoon classification pictures, the other left you big guy readers.
Python modules used
- The use
requests
.re
.threading
. - New thread parallel module added
threading
.
Key learning content
- Crawler basic routine;
- Uncertain page number data crawl;
- Fixed thread count crawler.
List page and detail page analysis
- The list page captures cartoon pictures, so the list page is
https://www.keaitupian.net/dongman/
, click multiple page numbers, you can get the following rules: - www.keaitupian.net/dongman/lis…
- www.keaitupian.net/dongman/lis…
- www.keaitupian.net/dongman/lis…
List due to the total number of pages to be able to obtain directly, consider using tests of large number, when enter https://www.keaitupian.net/dongman/list-110.html page does not exist, as shown in the following figure.
After the actual test, it is found that the following table pages of this classification exist 77 pages.
Click any picture detail page to check the specific content of the picture page. It is found that the detail page also has page-turning, and this page-turning can jump between the list pages. For example, after page-turning to 9/9, you can enter the next set of pictures. Therefore, data fetching can be carried out directly against the detail page.
Obtain the last group of photos on page 77 of the list, check the page turning data code of the last group of photos, and find that the code of the last page turning to the right is empty, that is, the page cannot be turned.
Last page data view:https://www.keaitupian.net/article/280-8.html#
.
Target site analysis completed, combing the overall logic, demand.
Organize requirements logic
- A detail page address was randomly selected as the starting page of crawler.
- One thread saves the image;
- A thread saves the next page address.
Encoding time
Fetch the target request address
Based on the above requirements, the first implementation of the loop URL thread, the thread is mainly used to repeatedly climb the URL address, save to a global list.
Threading.thread is used to create and start threads, and Thread mutex is used to ensure data transfer between threads.
Lock declaration:
mutex = threading.Lock()
Copy the code
Use of locks:
global urls
# locked
mutex.acquire()
urls.append(next_url)
mutex.release()
Copy the code
Write the URL to obtain the address code as follows:
import requests
import re
import threading
import time
headers = {
"User-Agent": "Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"}
# global urls
urls = []
mutex = threading.Lock()
# loop to get URL
def get_image(start_url) :
global urls
urls.append(start_url)
next_url = start_url
whilenext_url ! ="#":
res = requests.get(url=next_url, headers=headers)
if res is not None:
html = res.text
pattern = re.compile(' ')
match = pattern.search(html)
if match:
next_url = match.group(1)
if next_url.find('www.keaitupian') < 0:
next_url = f"https://www.keaitupian.net{next_url}"
print(next_url)
# locked
mutex.acquire()
urls.append(next_url)
# releases the lock
mutex.release()
if __name__ == '__main__':
# Fetch image thread
gets = threading.Thread(target=get_image, args=(
"https://www.keaitupian.net/article/202389.html",))
gets.start()
Copy the code
Run the code to get the target address to be captured, and the console output is as follows:
Extract the target address picture
The following is the last step, through the above code to grab the link address, extract the picture address, and save the picture.
Save image is also a thread, which corresponds to save_image function as follows:
# Save image thread
def save_image() :
global urls
print(urls)
while True:
# locked
mutex.acquire()
if len(urls) > 0:
Get the first item in the list
img_url = urls[0]
Delete the first item in the list
del urls[0]
# releases the lock
mutex.release()
res = requests.get(url=img_url, headers=headers)
if res is not None:
html = res.text
pattern = re.compile(
' ')
img_match = pattern.search(html)
if img_match:
img_data_url = img_match.group(1)
print("Grab the picture:", img_data_url)
try:
res = requests.get(img_data_url)
with open(f"images/{time.time()}.png"."wb+") as f:
f.write(res.content)
except Exception as e:
print(e)
else:
print("Waiting, long waiting, can be turned off.")
Copy the code
Synchronization adds a thread based on that function to the main function and starts:
if __name__ == '__main__':
# Fetch image thread
gets = threading.Thread(target=get_image, args=(
"https://www.keaitupian.net/article/202389.html",))
gets.start()
save = threading.Thread(target=save_image)
save.start()
Copy the code