The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with
The following article is from Tencent Cloud by Li Zheng
(Want to learn Python? Python Learning exchange group: 1039649593, to meet your needs, materials have been uploaded to the group file stream, you can download! There is also a huge amount of new 2020Python learning material.)
The crawler principle
Know what it is and why. To use crawler, we must first understand the principle of crawler, and first talk about the basic process and basic strategy of crawler.
The basic flow of crawlers
The basic working process of web crawler is as follows:
- Provide seed URL
- The task queue starts processing the seed URL
- According to the URL, resolve DNS, download the corresponding URL web page, store the downloaded web page, the URL into the captured URL queue.
- Analyze the captured URL queue, put the inner chain of THE URL into the URL queue to be captured, and cycle
- Parse the downloaded web page to obtain the required data
- Store in database, data persistent processing
The basic strategy of crawlers
The pending URL queue is a very important part in crawler system. The order in which the URL queues are processed is also important, because it relates to the order in which pages are fetched, and the method that determines the ordering of these URL queues is called fetching strategy.
Here are two common strategies:
- DFS(Depth-first strategy) Depth-first strategy means that the crawler starts from a URL and crawls down link by link until it has processed all the routes of a link before switching to other routes. At this time to grab this order: – > C – A – > B > D – > E – > F – > G – > H – > I – > J
- The basic idea of BFS(breadth-first strategy) breadth-first traversal strategy is to insert links found in newly downloaded web pages directly into the end of the URL queue to be captured. That is to say, the web crawler will grab all the pages linked in the initial page first, and then select one of the linked pages to continue to grab all the pages linked in this page. At this time to grab the order as follows: A – > B – > E – > G – > H – > I – > C – > F – > J – > D
The crawler tool
To do a good job, he must sharpen his tools.
Implementing a Python crawler requires a number of helpers, which are described below.
anaconda
Anaconda website – is a scientific computing release of Python.
This section uses the latest official version (18/1/10) 3-5.0.1 as an example.
In fact, it is easier to install in Win and works even better with PyCharm.
Because the resources are abroad, so the download speed is slow, you can use tsinghua University mirror source
$$bash wget HTTP: / / https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-5.0.1-Linux-x86_64.sh Anaconda3-5.0.1 - Linux - x86_64. ShCopy the code
After downloading and executing the script, install it step by step as prompted.
Requests
Requests official documentation – is an updated version of URllib that packs all functionality and simplifies usage.
Python modules are easy to install directly using the PIP directive
$ pip install requests
Copy the code
Of course, since you are installing the Anaconda distribution of Python, you can also install it using the conda directive.
$ conda install requests
Copy the code
LXML
An HTML parsing package is used to assist BeautifulSoup in parsing web pages.
$ pip install lxml
Copy the code
BeautifulSoup
BeautifulSoup official documentation – is a Python library that extracts data from HTML or XML files. It allows you to navigate, find, and modify documents in your favorite converter. For starters, the experience is largely due to self-matching using regular expressions.
$ pip install beautifulsoup4
Copy the code
Simple crawler test
Let’s start by creating the first script, which is already Python based by default.
#! /usr/bin/env python # coding= utF-8 import requests ## import requests from bS4 import BeautifulSoup ## Import BeautifulSoup from bS4 Import OS ## headers = {' user-agent ':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit / 537.1 (KHTML, } ## start URL all_url = 'http://www.mzitu.com/all' ## Get (all_URL,headers=headers) # print start_html print(start_html.text)Copy the code
Gets and lists all the titles and links of the girl graph.
This is the simplest example of a crawler.