As a summary of my study, I plan to record it on my blog. I also hope to share my learning process with you. Without more words, let’s start right away
A basic introduction to reptiles
What is a web crawler? Here is a reference to the analysis on Baidu Encyclopedia:
Web crawler is a program or script that can automatically capture information on the World Wide web according to certain rules
When it comes to web crawlers, people often use this metaphor: If compared the Internet into a net, web crawler can be considered to be a little bugs crawling around on the Internet, it through a web link address to find web pages, through specific search algorithm to determine the route, usually from the website of a page, read the content of the page, find other links in a web page, Then through these links to find the next page, and so on and so on, until the site has all the pages crawled
2. Basic principles of reptiles
The following picture shows the basic process of the general web crawler well. Let’s explain the meaning of this picture in detail
1. Send the request
The first step in a crawler is to send a request to the starting URL to get the response it returns
It is worth noting that sending a request is essentially the process of sending a request message
A request message contains the following four aspects: request line, request header, blank line and request body
However, when sending a request using Python’s associated network request library, it is often possible to focus only on specific parts, rather than the complete request message, which we will highlight in bold below
(1) Request line
A request line consists of three fields: request method, request URL, and HTTP protocol version. The fields are separated by Spaces
(1) Request method: request method refers to the operation mode of the target resource, common GET method and POST method
- GET: Requests data from the specified resource and sends the query string contained in the URL
- POST: Submit the data to be processed to the specified resource. The query string is sent in the request body
② Request URL: The request URL refers to the Uniform Resource Locator (URL) of the target website.
(3) HTTP protocol version: HTTP protocol refers to the standard that both sides of the communication comply with in terms of communication process and content format
(2) Request header
The request header is considered the configuration information of the request. Common request headers are listed below (ongoing)
- User-agent: contains information about the User that makes the request. Setting user-Agent is often used to process anti-crawler
- Cookie: Contains previously requested content. Setting cookies is often used to simulate login
- Referer: Indicates the source of the request and is used to prevent chain thieves and malicious requests
(3) Blank line
A blank line marks the end of the request header
(4) Request body
The request body contains different contents depending on the request method
If the request method is GET, this item is null. If the request method is POST, this is the data to be submitted (that is, form data)
2. Get the response
The second step of the crawler is to retrieve the response returned by a particular URL to extract the data contained therein
Similarly, a response actually refers to the complete response message, which consists of four parts: response line, response header, blank line and response body
(1) Response line
The response line consists of the HTTP protocol version, the status code, and its description
① HTTP protocol version: HTTP protocol refers to the standard that both sides of communication comply with in terms of communication process and content format
② Status code and its description
- 100 to 199: information. The server receives the request and requires the requester to continue the operation
- 200 to 299: succeeded. The operation is received and processed successfully
- 300 to 399: redirection. Further operations are required to complete the request
- 400 to 499: A client error occurs. The request contains syntax errors or cannot be completed
- 500 to 599: Server error. An error occurred when the server was processing the request
(2) Response header
Response headers are used to describe basic information about the server and data. Common response headers are listed below (ongoing)
- Set-cookie: Sets the Cookie of the browser. When the browser accesses the URL that meets the requirements, the Cooike is automatically added to the Cookie
(3) Blank line
A blank line marks the end of the response header
(4) Response body
The response body is the data returned by the website, which we need to analyze in the next step
3. Parse web pages
Parsing web pages essentially needs to complete two things, one is to extract links on the web page, the other is to extract resources on the web page
(1) Extract links
Extracting links essentially means retrieving links to other web pages that exist on the web page to be parsed
Web crawlers need to send requests to these links, and so on, until they have all crawled a particular site
(2) Extracting resources
Data extraction is the purpose of crawler, and common data types are as follows:
- Text: HTML, JSON, etc
- Pictures: JPG, GIF, PNG, etc
- Video: MPEG-1, MPEG-2, MPEG4, AVI, etc
Finally, we can further process the obtained resources to extract valuable information