This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money.
20 lines of code to become a meaty geek
Objective of this blog
Crawl target
- Channel of net, fleshy, target data source: www.zhimengo.com/duoroutu?pa…
Using a framework
- Requests, re
- Some readers said, why not crawler other advanced frameworks? A: Because it is a series of 120 cases of crawlers, we have progressed from shallow to deep, so far only the fifth one has been carried out.
Key learning content
- Get request;
- Double process crawl, one process grab 1-25 pages, one process grab 26-55 pages;
- Picture Number and name
List page and detail page analysis
- Paging data clearly identified, can directly read the total number of pages;
- The details page link is available directly.
There are multiple images on the detail page, as shown below. During the crawling process, the images need to be numbered and named. For example, Baifeng Duorou out 1. PNG, Baifeng Duorou out 2. PNG.
Encoding time
The first step is to get the details page address in the list page.
This step solution is the same as the previous example, which takes the source code of the web page and parses it through regular expressions. Before parsing, you can intercept the string to obtain the HTML code of the target area.
import requests
import re
headers = {
"user-agent": "Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36"
}
def get_list() :
""" Get full details page links """
all_list = []
res = requests.get(
"https://www.zhimengo.com/duoroutu?page=1", headers=headers)
html = res.text
start = '<ul class="list2">'
end = '<ul class="pagination">'
html = html[res.text.find(start):res.text.find(end)]
pattern = re.compile(
'. *?
')
all_list = pattern.findall(html)
return all_list
Copy the code
Write the run function and call it.
def run() :
url_list = get_list()
print(url_list)
if __name__ == "__main__":
run()
Copy the code
After running the code, you get the following list of detail pages, with the title used for subsequent image storage.
The second step is to obtain all the pictures of succulence through the address of the detail page. In the process of coding, the HTML page data returned by Python request is inconsistent with that viewed by the developer tool. The differences are as follows. The biggest impact of this difference is regular expression writing.
The code to capture the detail page image address is shown below, where the experience is to test the regular expression with a fixed address.
def get_detail(title, url) :
res = requests.get(url=url, headers=headers)
html = res.text
print(html)
pattern = re.compile(
' ')
imgs = pattern.findall(html)
for index, url in enumerate(imgs):
print(title, index, url)
# save_img(title, index, url)
def run() :
url_list = get_list()
# print(url_list)
# for url, title in url_list:
get_detail("Pink vine succulent"."https://www.zhimengo.com/duoroutu/24413")
Copy the code
Save_img is the function to save the image. The specific code is as follows:
def save_img(title, index, url) :
try:
img_res = requests.get(url, headers=headers)
img_data = img_res.content
print(F "grab:{url}")
with open(f"images/{title}_{index}.png"."ab+") as f:
f.write(img_data)
except Exception as e:
print(e)
Copy the code
Next, continue to reform the code, this case to achieve two process crawler, that is, two process crawl, one process grab 1-25 pages, one process grab 26-55 pages.
The improved code here is mainly the run function, as follows:
def run(start, end) :
wait_url = [
f"https://www.zhimengo.com/duoroutu?page={i}" for i in range(int(start), int(end)+1)]
print(wait_url)
url_list = []
for item in wait_url:
ret = get_list(item)
# print(len(ret))
print(F "has been captured:{len(ret)}The data")
url_list.extend(ret)
# print(len(url_list))
for url, title in url_list:
get_detail(title, url)
if __name__ == "__main__":
start = input("Please enter start page:")
end = input("Please enter closing page:")
run(start, end)
Copy the code
Run the program and test it according to the network speed. Enable the Python script twice in two processes and manually differentiate the page numbers.