This article is participating in Python Theme Month. See the link to the event for more details
To prepare
What a Crawler is: A web crawler (also known as a web spider, web robot, or more commonly, a web chaser in the FOAF community) is a program or script that automatically crawls information from the World Wide Web according to a set of rules. Other less commonly used names are ants, automated indexes, simulators, or worms. Baidu Encyclopedia details
With the advent of the era of big data, people have more and more demands for data resources, and crawler is a good means of automatic data collection. Recommend a Python web crawler to learn the route of interpretation
Share a few Python learning links:
1. Please call me CSDN of Wanghai
2. Liao Xuefeng big tutorial and Liao Xuefeng big video version tutorial
3. Creeper White Primer
4. Crawler framework Scrapy
Began to work
Download Python
The installation is complete
Search for a crawling picture code
Predecessors plant trees, descendants shade, thank you!
Install dependencies
Use PIP install *** to install the dependencies
pip install urllib
Copy the code
Python -m PIP install –upgrade PIP
Run the project
python main.py
Copy the code
Run and find the image downloaded successfully.
Attach the code and the code address
Practice make Perfect! .
Lot code
Gitee code
# -*- coding:utf-8 -*- import os import random import ssl import time import urllib.request from bs4 import BeautifulSoup # Request header configuration USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6 Like Gecko) Chrome/86.0.4240.111 Safari/537.36" /images" def start_work(serial_id): picture_dir = BASE_DIR + os.sep + serial_id if not os.path.exists(picture_dir): Os. mkdir(picture_dir) page_count = get_page_count(serial_id) print("%s %d pictures "% (serial_id, # get_image_for_serial(picture_dir,serial_id,page_count) # header = {"user-agent": USER_AGENT} context = ssl._create_unverified_context() url = "%s/%s" % (BASE_URL, serial_id) req = urllib.request.Request(url, headers=header) resp = urllib.request.urlopen(req, context=context) content = resp.read() str_content = content.decode("utf-8") total_count = __get_counts(str_content) Def __get_counts(html_content): page_count = 0 soup = BeautifulSoup(html_content, 'lxml') data = soup.select("body > div.main > div.content > div.pagenavi > a > span") if data and len(data) >= 3: Page_count = int(data[-2].get_text()) return page_count # soup = BeautifulSoup(html_content, 'lxml') data = soup.select("body > div.main > div.content > div.main-image > p > a > img") url = None try: url = data[0].get("src") except Exception as ex: Print ("exception occur:%s" % ex) def get_all_image_urls(serial_id, page_count) url_list=list() header = {"user-agent": USER_AGENT} context = ssl._create_unverified_context() if page_count <= 1: return for x in range(1,page_count+1): Print (" %s/%s/%s") req = urllib.request.Request(url, BASE_URL, serial_id, x) headers=header) resp = urllib.request.urlopen(req, context=context) content = resp.read() str_content = content.decode("utf-8") img_url = get_image_url(str_content) if img_url: Url_list.append (img_url) print(" %d image address :%s" % (x, img_url)) time.sleep(random.randint(1, 1) Def get_image_for_serial(dir_path, serial_id, total_count) for i in range(1, total_count + 1): Print_image_for_index (dir_path, serial_id, I) sleep_seconds = random.randint(1, Def get_image_for_index(dir_path, serial_id, page_index): header = {"user-agent": USER_AGENT} context = ssl._create_unverified_context() print(" % page_index ") ref_url = "%s/%s/%s" % (BASE_URL, serial_id, page_index) req = urllib.request.Request(ref_url, headers=header) resp = urllib.request.urlopen(req, context=context) content = resp.read() str_content = content.decode("utf-8") img_url = get_image_url(str_content) if img_url: Print (" %d image address :%s" % (page_index, img_url)) print(" try to save image %s" % img_url) save_img(dir_path, img_url, Def save_imgs(dir_path, img_urls): for img_addr in img_urls: Def save_img(dir_path, img_URL, ref_URL): header = {"user-agent": USER_AGENT, "Referer": ref_url } context = ssl._create_unverified_context() req = urllib.request.Request(img_url, headers=header) resp = urllib.request.urlopen(req, context=context) content = resp.read() with open(dir_path+os.sep+img_url.split('/')[-1], 'wb') as f: Print (" save file to %s :%s" % (dir_path, img_url.split('/')[-1])) time.sleep(random.randint(1,); 2)) if __name__ == "__main__": vol_list = ["204061"] for serial_id in vol_list: start_work(serial_id)Copy the code
summary
Did I learn Python? No!! I just successfully installed Python and ran a use case, which is a crawler practice. All I know now are some basic types and functions.