[Python3 web crawler development combat] 2- crawler basics 1-HTTP basic principles

In this section, we’ll take a closer look at the basics of HTTP and see what happens between typing in a URL in the browser and retrieving web content. Knowing these contents will help us further understand the basic principle of crawler.

1. The URI and URL

Here we first learn about URIs and urls. The full name of URIs is Uniform Resource Identifier, and the full name of urls is Universal Resource Locator.

For example, github.com/favicon.ico is GitHub’s icon link, which is both a URL and a URI. There is an icon resource whose access is uniquely specified using a URL/URI. This includes the access protocol HTTPS, the access path (/ is the root directory), and the resource name favicon.ico. This resource can be found on the Internet through a link called the URL/URI.

A URL is a subset of a URI, which means that every URL is a URI, but not every URI is a URL. So, what kind of URI is not a URL? Uris also include a subclass called URN, whose full Name is Universal Resource Name. URN names the resource without specifying how to locate it. For example, URN: ISBN :0451450523 specifies the ISBN of a book that uniquely identifies the book, but does not specify where to locate the book. Figure 2-1 shows the relationship between URLS, UrNs, and URIs.

Figure 2-1 Relationship between URLS, UrNs, and URIs

But urNs are rarely used on the Internet today, so almost all URIs are urls. Common web links can be called either URLS or URIs, but I prefer to call them urls.

2. Hypertext

Next, we understand a concept — hypertext, its English name is called hypertext, we see the web page in the browser is hypertext parsing, the source code of the web page is a series of HTML code, which contains a series of tags, such as IMG display image, P specified display paragraph and so on. After the browser parses these tags, it forms the web page we see everyday, and the source code of the web page, HTML, is called hypertext.

For example, we open any page in Chrome, such as taobao page, right-click anywhere and select “check” (or press F12) directly, can open the browser developer tools, at this time in the Elements TAB you can see the current page’s source code, the source code is a hypertext, as shown in figure 2-2.

Figure 2-2 Source code

3. HTTP and HTTPS

On taobao’s home page www.taobao.com/, the URL begins with HTTP or HTTPS, which is the protocol type required to access resources. Sometimes we also see urls that start with FTP, SFTP, and SMB, all of which are protocol types. In crawlers, the pages we grab are usually HTTP or HTTPS protocols. Here we first understand the meaning of these two protocols.

HTTP stands for Hyper Text Transfer Protocol. HTTP protocol is used to transfer hypertext data from network to local browser. It can ensure efficient and accurate transmission of hypertext documents. HTTP is a specification jointly developed by the World Wide Web Consortium (W3C) and the Internet Engineering Task Force (IETF). Currently, HTTP version 1.1 is widely used.

HTTPS is the Hyper Text Transfer Protocol over Secure Socket Layer. It is an HTTP channel aimed at security. In short, IT is the Secure version of HTTP, that is, ADDING SSL Layer to HTTP.

The security of HTTPS is based on SSL. Therefore, the content transmitted through HTTPS is encrypted by SSL. The functions of HTTPS can be divided into two aspects.

Establish an information security channel to ensure the security of data transmission.
Verify the authenticity of the website. For HTTPS websites, you can click the lock icon in the address bar of the browser to view the real information after the website authentication or use the security seal issued by the CA.

Now more and more websites and apps are developing towards HTTPS, for example:

Apple is forcing all iOS apps to use HTTPS encryption by January 1, 2017, otherwise they will not be available in the App Store.
Starting with Chrome 56, which was launched in January 2017, Google has put a risk alert on links that are not encrypted with HTTPS. That is, it prominently reminds users that “this page is not secure”.
Tencent wechat mini program’s official requirements document requires the background to use HTTPS requests for network communication. Domain names and protocols that do not meet the requirements cannot be requested.

For example, if you open 12306 in Chrome and enter www.12306.cn/, the message “Your connection is not private” is displayed, as shown in Figure 2-3.

Figure 2-3 12306 page

This is because the CA certificate of 12306 is issued by the Ministry of Railways of China, and this certificate is not trusted by THE CA organization, so the certificate verification will not pass, but in fact its data transmission is still encrypted through SSL. If you want to climb such a site, you need to set the option to ignore the certificate or you will be prompted with an SSL link error.

4. HTTP request process

We type a URL into the browser, press Enter and see the page content in the browser. In effect, the browser sends a request to the server where the web site is located. The web server receives the request, processes it, parses it, and returns the corresponding response, which is then passed back to the browser. The response contains the source code and other content of the page, which is then parsed by the browser to present the web page, as shown in Figure 2-4.

Figure 2-4 Model diagram

In this case, the client represents our own PC or mobile browser, and the server is the server where the website to be accessed resides.

To illustrate the process more visually, here’s a demo using Chrome’s Developer mode Network Listener component, which displays all Network requests and responses that occur when visiting the currently requested page.

Open Chrome, right-click and choose Check to open the browser’s developer tools. Go to baidu www.baidu.com/, type the URL and press Enter to see what kind of network requests occur. As can be seen, items appear one by one at the bottom of the Network page, one of which represents the process of sending a request and receiving a response at a time, as shown in Figure 2-5.

Figure 2-5 Network panel

Let’s look at the first network request, www.baidu.com.

The meaning of each column is as follows.

The first column Name: the Name of the request, usually the last part of the URL.
The second column Status: the Status code of the response, which is displayed as 200, indicating that the response is normal. Using the status code, we can determine whether the request received a normal response after it was sent.
The third column Type: the document Type requested. This is called document, which means what we’re asking for this time is an HTML document, and the content is just some HTML code.
Column 4 Initiator: request source. Used to mark the object or process from which the request originated.
Column 5 Size: the Size of the file downloaded from the server and the requested resource. If the resource is fetched from the cache, this column displays from Cache.
Column 6 Time: The total Time from the Time the request was initiated to the Time the response was received.
7th column Waterfall: Visual Waterfall flow with network requests.

Click on the entry to see more details, as shown in Figure 2-6.

Figure 2-6 Details

Request URL = Request Method = Request Method = Status Code = response Status Code = Remote Address = Address and port of the Remote server Referrer Policy refers to the Referrer discrimination Policy.

Further down, you can see that there are Response Headers and Request Headers, which represent the Response Headers and Request Headers, respectively. The request header contains a lot of request information, such as browser id, Cookies, Host, etc., which is a part of the request. The server will judge whether the request is legitimate according to the information in the request header, and then make corresponding response. The Response Headers shown in the figure is part of the Response, which contains information such as server type, document type, and date. After receiving the Response, the browser will parse the Response to present the web page content.

Let’s take a look at what’s involved in the request and response.

5. Request

A Request is sent by the client to the server and consists of four parts: Request Method, Request URL, Request Headers, and Request Body.

(1) Request method

There are two common request methods: GET and POST.

Type the URL directly into the browser and press Enter. This initiates a GET request, and the parameters of the request are included directly in the URL. For example, if you search for Python in Baidu, this is a GET request, linked to www.baidu.com/s?wd=Python, where the URL contains the parameters of the request, where wd indicates the keyword to be searched. POST requests are mostly made when the form is submitted. For example, for a login form, clicking the “Login” button after entering a user name and password usually initiates a POST request whose data is usually transferred as a form rather than reflected in a URL.

There are differences between the GET and POST request methods.

The parameters in the GET request are contained in the URL, and the data can be seen in the URL, while the URL of the POST request does not contain the data. The data is transmitted through the form, and will be contained in the request body.
GET requests submit a maximum of 1024 bytes of data, while POST has no limit.

Generally speaking, when you log in, you need to submit the user name and password, which contains sensitive information. If you use GET mode, the password will be exposed in the URL, resulting in password leakage. Therefore, it is best to send the password in POST mode. POST is also used to upload a large file.

Most of the requests we encounter are GET or POST requests. In addition, there are some request methods, such as GET, HEAD, POST, PUT, DELETE, OPTIONS, CONNECT, TRACE, etc., which are briefly summarized in Table 2-1.

Table 2-1 Other request methods

methods	describe
GET	Request the page and return the page content
HEAD	Similar to a GET request, except that there is no concrete content in the response returned, which is used to retrieve the header
POST	Most are used to submit forms or upload files, and the data is contained in the request body
PUT	Data sent from the client to the server replaces the contents of the specified document
DELETE	Asks the server to delete the specified page
CONNECT	Use the server as a springboard to access other web pages instead of the client
OPTIONS	Allows clients to view server performance
TRACE	The command output displays the requests received by the server for testing or diagnosis

This table reference: www.runoob.com/http/http-m… .

(2) Requested url

The requested URL, the Uniform Resource Locator URL, uniquely identifies the resource we want to request.

(3) the request header

The request header is used to describe the additional information to be used by the server, such as Cookie, Referer, user-agent and so on. Some common headers are briefly described below.

Accept: Request header field that specifies what types of information the client can Accept.
Accept-language: Specifies the Language type acceptable to the client.
Accept-encoding: Specifies the content Encoding acceptable to the client.
Host: Specifies the IP address and port number of the Host requesting the resource. The content is the location of the original server or gateway requesting the URL. As of HTTP 1.1, requests must include this content.
Cookie: Also commonly used in the plural, Cookies. These are data stored locally by a web site in order to identify whether a user is tracking a session. Its main function is to maintain the current access session. For example, after we enter the user name and password to log in to a website, the server will use the session to save the login status information. Later, when we refresh or request other pages of the site, we will find that the login status is all due to Cookies. Cookies contain information that identifies the session of our corresponding server. Every time the browser requests the page of this site, it will add Cookies in the request header and send them to the server. The server identifies ourselves through Cookies and finds out that the current state is login state. So the return result is the content of the web page that you can only see after logging in.
Referer: This content is used to identify the page from which the request was sent. The server can get this information and do corresponding processing, such as source statistics, anti-theft processing, etc.
User-agent: UA for short. UA is a special string header that enables the server to identify information such as the operating system and version, browser and version used by the customer. Adding this information to a crawler can masquerade as a browser; If not, it is likely to be recognized as a reptile.
Content-type: Also called Internet Media Type or MIME Type. In HTTP headers, it is used to indicate the Media Type in a specific request. For example, text/ HTML stands for HTML format, image/ GIF stands for GIF image, and application/json stands for JSON type. You can check the mapping table at tool.oschina.net/commons for more information.

Therefore, the request header is an important part of the request, and it is necessary to set the request header in most cases when writing the crawler.

(4) the request body

The request body typically carries the form data in a POST request, whereas for A GET request, the request body is empty.

For example, the request and response captured when I log into GitHub is shown in Figure 2-7.

Figure 2-7 Details

Before login, we fill in the user name and password information, which will be submitted to the server as form data. Note that the Content-type specified in Request Headers is Application/X-www-form-urlencoded. Form data will be submitted only if the content-type is set to Application/X-www-form-urlencoded. Alternatively, we can set the content-type to Application /json to submit JSON data, or multipart/form-data to upload files.

Table 2-2 lists the relationship between the Content-Type and POST data submission methods.

Table 2-2 Relationship between the content-Type and POST data submission modes

Content-Type	The way data is submitted
application/x-www-form-urlencoded	The form data
multipart/form-data	Form file upload
application/json	Serialize JSON data
text/xml	The XML data

In crawler, if POST request is to be constructed, it is necessary to use the correct Content-Type and understand which content-Type is used for each parameter setting of various request libraries, otherwise the normal response may not be possible after POST submission.

6. The response

A Response is returned by the server to the client and can be divided into Response Status Code, Response Headers, and Response Body.

(1) Response status code

The response status code indicates the response status of the server. For example, 200 indicates that the server is responding properly, 404 indicates that the page is not found, and 500 indicates that an internal error occurs on the server. In crawler, we can judge the response state of the server according to the status code. If the status code is 200, it will prove that the data has been successfully returned and further processing will be carried out. Otherwise, it will be directly ignored. Table 2-3 lists the common error codes and causes.

Table 2-3 Common error codes and causes

Status code	instructions	details
100	Continue to	The requester should continue to make the request. The server has received part of the request and is waiting for the rest
101	Switch protocols	The requester has asked the server to switch protocols, and the server has confirmed and is ready to switch
200	successful	The server has successfully processed the request
201	Has been created	The request succeeds and the server creates a new resource
202	Have accepted	The server has accepted the request but has not yet processed it
203	Unauthorized information	The server has successfully processed the request, but the information returned may come from another source
204	There is no content	The server successfully processed the request, but did not return anything
205	Reset the content	The server successfully processed the request and the content was reset
206	Part of the content	The server successfully processed some of the requests
300	A variety of options	The server can perform a variety of actions on a request
301	A permanent move	The requested web page has been permanently moved to a new location, a permanent redirect
302	Temporary mobile	The requested page is temporarily redirected to another page
303	Look at other locations	If the original request was POST, the redirected target document should be extracted via GET
304	unmodified	The page returned by this request has not been modified, continuing to use the last resource
305	Using the agent	The requester should access the page using a proxy
307	Temporary redirection	The requested resource temporarily responds from another location
400	Bad request	The server was unable to resolve the request
401	unauthorized	The request was not authenticated or failed authentication
403	Blocking access	The server rejected the request
404	Not found	The server could not find the requested page
405	Method to disable	The server disabled the method specified in the request
406	Don’t accept	The requested web page cannot be responded to with the requested content
407	Agency authorization is required	The requester needs to use proxy authorization
408	The request timeout	Server request times out
409	conflict	A server conflict occurred while completing a request
410	deleted	The requested resource has been permanently deleted
411	Required effective length	The server will not accept requests that do not contain a valid content-length header field
412	Prerequisites are not met	The server did not meet one of the prerequisites that the requester set in the request
413	Request entity too large	Request entities are too large for the server to handle
414	The request URI is too long	The requested URL is too long for the server to process
415	Unsupported types	The request format is not supported by the requested page
416	Inconsistent scope of request	The page cannot provide the requested scope
417	Unsatisfied expected value	The server did not meet the requirements for the expected request header field
500	Server internal error	The server encountered an error and could not complete the request
501	Unrealized.	The server does not have the capability to complete the request
502	Bad gateway	The server, acting as a gateway or proxy, received an invalid response from the upstream server
503	Service unavailable	The server is currently unavailable
504	Gateway timeout	The server acts as a gateway or proxy, but does not receive requests from the upstream server in time
505	The HTTP version is not supported	The server does not support the HTTP protocol version used in the request

(2) the response headers

The response header contains the Server’s response to the request, such as Content-Type, Server, and set-cookie. Some common headers are briefly described below.

Date: indicates the time when the response is generated.
Last-modified: Specifies the time when the resource was Last Modified.
Content-encoding: Specifies the Encoding of the response Content.
Server: Contains information about the Server, such as the name and version number.
Content-type: Specifies the Type of data to be returned. For example, text/ HTML means to return AN HTML document, Application/X-javascript means to return a javascript file, and image/ JPEG means to return an image.
Set-cookie: Sets Cookies. Set-cookie in the response header tells the browser that it needs to place this content in Cookies, and the next request carries the Cookie request.
Expires: Specifies an expiration time for a response that causes the proxy server or browser to update the loaded content into the cache. If accessed again, it can be loaded directly from the cache, reducing the load on the server and shortening the load time.

(3) the response body

Most important is the content of the response body. The body data of the response is in the response body. For example, when a web page is requested, its response body is the HTML code of the web page. When an image is requested, its response body is the binary data of the image. After we do crawler request web page, the content to be resolved is the response body, as shown in Figure 2-8.

Figure 2-8 Response body content

Click Preview in the browser developer tools, and you’ll see the source code for the web page, the contents of the response body, which is the target for parsing.

When doing crawler, we mainly get the source code and JSON data of the web page through the response body, and then extract the corresponding content from it.

In this section, we’ve looked at the basics of HTTP, and roughly the request and response process behind accessing a web page. This section covers a lot of information that needs to be understood and is often used when analyzing web requests.

This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience

For more crawler information, please follow my personal wechat official account: Attack Coder

Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)