This article is participating in Python Theme Month. See the link to the event for more details

To prepare

What a Crawler is: A web crawler (also known as a web spider, web robot, or more commonly, a web chaser in the FOAF community) is a program or script that automatically crawls information from the World Wide Web according to a set of rules. Other less commonly used names are ants, automated indexes, simulators, or worms. Baidu Encyclopedia details

With the advent of the era of big data, people have more and more demands for data resources, and crawler is a good means of automatic data collection. Recommend a Python web crawler to learn the route of interpretation

Share a few Python learning links:

1. Please call me CSDN of Wanghai

2. Liao Xuefeng big tutorial and Liao Xuefeng big video version tutorial

3. Creeper White Primer

4. Crawler framework Scrapy

Began to work

Download Python

The installation is complete

Search for a crawling picture code

Predecessors plant trees, descendants shade, thank you!

Install dependencies

Use PIP install *** to install the dependencies

pip install urllib
Copy the code

Python -m PIP install –upgrade PIP

Run the project

python main.py
Copy the code

Run and find the image downloaded successfully.

Attach the code and the code address

Practice make Perfect! .

Lot code

Gitee code

# -*- coding:utf-8 -*- import os import random import ssl import time import urllib.request from bs4 import BeautifulSoup # Request header configuration USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6 Like Gecko) Chrome/86.0.4240.111 Safari/537.36" /images" def start_work(serial_id): picture_dir = BASE_DIR + os.sep + serial_id if not os.path.exists(picture_dir): Os. mkdir(picture_dir) page_count = get_page_count(serial_id) print("%s %d pictures "% (serial_id, # get_image_for_serial(picture_dir,serial_id,page_count) # header = {"user-agent": USER_AGENT} context = ssl._create_unverified_context() url = "%s/%s" % (BASE_URL, serial_id) req = urllib.request.Request(url, headers=header) resp = urllib.request.urlopen(req, context=context) content = resp.read() str_content = content.decode("utf-8") total_count = __get_counts(str_content) Def __get_counts(html_content): page_count = 0 soup = BeautifulSoup(html_content, 'lxml') data = soup.select("body > div.main > div.content > div.pagenavi > a > span") if data and len(data) >= 3: Page_count = int(data[-2].get_text()) return page_count # soup = BeautifulSoup(html_content, 'lxml') data = soup.select("body > div.main > div.content > div.main-image > p > a > img") url = None try: url = data[0].get("src") except Exception as ex: Print ("exception occur:%s" % ex) def get_all_image_urls(serial_id, page_count) url_list=list() header = {"user-agent": USER_AGENT} context = ssl._create_unverified_context() if page_count <= 1: return for x in range(1,page_count+1): Print (" %s/%s/%s") req = urllib.request.Request(url, BASE_URL, serial_id, x) headers=header) resp = urllib.request.urlopen(req, context=context) content = resp.read() str_content = content.decode("utf-8") img_url = get_image_url(str_content) if img_url: Url_list.append (img_url) print(" %d image address :%s" % (x, img_url)) time.sleep(random.randint(1, 1) Def get_image_for_serial(dir_path, serial_id, total_count) for i in range(1, total_count + 1): Print_image_for_index (dir_path, serial_id, I) sleep_seconds = random.randint(1, Def get_image_for_index(dir_path, serial_id, page_index): header = {"user-agent": USER_AGENT} context = ssl._create_unverified_context() print(" % page_index ") ref_url = "%s/%s/%s" % (BASE_URL, serial_id, page_index) req = urllib.request.Request(ref_url, headers=header) resp = urllib.request.urlopen(req, context=context) content = resp.read() str_content = content.decode("utf-8") img_url = get_image_url(str_content) if img_url: Print (" %d image address :%s" % (page_index, img_url)) print(" try to save image %s" % img_url) save_img(dir_path, img_url, Def save_imgs(dir_path, img_urls): for img_addr in img_urls: Def save_img(dir_path, img_URL, ref_URL): header = {"user-agent": USER_AGENT, "Referer": ref_url } context = ssl._create_unverified_context() req = urllib.request.Request(img_url, headers=header) resp = urllib.request.urlopen(req, context=context) content = resp.read() with open(dir_path+os.sep+img_url.split('/')[-1], 'wb') as f: Print (" save file to %s :%s" % (dir_path, img_url.split('/')[-1])) time.sleep(random.randint(1,); 2)) if __name__ == "__main__": vol_list = ["204061"] for serial_id in vol_list: start_work(serial_id)Copy the code

summary

Did I learn Python? No!! I just successfully installed Python and ran a use case, which is a crawler practice. All I know now are some basic types and functions.