Interpreting HTTP secrets in crawlers (Basics)

Author: xiaoyu

Python Data Science

Zhihu: zhuanlan.zhihu.com/pypcfx

In the process of learning crawler, I believe you are no stranger to the word HTTP, it seems never left our sight. Forced to use developer tools to look at the request header, the response header, and the various fields in the header each time, fill in the information with modules that someone else has wrapped up, and just type a few lines of code. We may not care what a simple crawling task is, but we may not know where to start when we have a real problem.

Knowing and deeply understanding HTTP is very helpful for crawler implementation process. In order to give you a better understanding of HTTP in crawlers, the blogger will cover HTTP in two chapters, “Basics” and “Advanced”. This is the foundation part, will carry on elaboration from the following several parts.

What is the HTTP
A complete HTTP request process
HTTP request packet

What is HTTP?

< Introduction to HTTP >

Authoritative answer quoted from Baidu Baike:

HyperText Transfer Protocol (HTTP) is the most widely used network Protocol on the Internet. All WWW files must comply with this standard. HTTP was originally designed to provide a way to publish and receive HTML pages. In 1960, TedNelson, an American, conceived a method of processing text information by computer and called it hypertext, which became the foundation of the standard architecture of HTTP hypertext transfer protocol. TedNelson organized a collaborative effort between the World Wide Web Consortium and the InternetEngineering Task Force that resulted in the release of a series of RFCS, The famous RFC 2616 defines HTTP 1.1.

The HTTP protocol is used to transfer hypertext from the WWW server to the local browser. It can make browsers more efficient and reduce network traffic. It not only ensures that the computer transfers the hypertext document correctly and quickly, but also determines which parts of the document to transfer and which parts of the content to display first (e.g. text before graphics).

<HTTP model >

HTTP adopts the browser/server request/response model, in which the browser is always the initiator of HTTP requests and the server is the responder.

In this case, the server cannot actively push messages to the client without the browser client initiating the request.

<HTTP location >

HTTP is an application layer protocol that is the most intuitive request for information from the server. For example, the < URllib module > and < Requests module > used in crawlers encapsulate THE HTTP protocol and serve as an HTTP client to download information sources such as blog posts, pictures and videos.

However, HTTP is not straightforward to use; its requests are based on some underlying protocol. For example, in the TCP/IP protocol stack, HTTP requests can be sent to the server only after the TCP three-way handshake is successful. Of course, for HTTPS, TSL and SSL security layers are required.

A complete HTTP request process

Since the HTTP protocol needs to be built on top of other underlying protocols, let’s take a look at what a complete HTTP request looks like.

When we click on a link or enter a link, the entire HTTP request process begins, and then the following steps are taken to get the final message. Here we briefly describe the first four steps to understand HTTP.

Domain name resolution: Searches various local locations firstDNS cacheIf not, it willThe DNS serverInitiate domain name resolution to obtainThe IP address.
Establishing a TCP Connection: When the IP is obtained, a socket connection is created, i.eTCP three-way handshakeConnect, the default port number80.
The HTTP request: After the TCP connection is successful, the browser/crawler can send an HTTP request packet to the serverRequest line, request header, request body.
Server response: The server responds and returns an HTTP response package (a status code if successful)200) and the requested HTML code. The steps above< 3 >and< 4 >You can simply indicate the following, more convenient for everyone to understand. Both the request and the response contain information in a specific format, which we’ll explain next.

A response to an HTTP request returns a response status code, from which you can know the status of the returned information. The status codes are as follows:

1xx: Message response class, indicating that the request has been received and continues to process 100 — the request must continue to be made 101 — requiring the server to convert the HTTP protocol version based on the request 2xx: 202 — Accepted and processed, but not complete 203 — Returned message uncertain or incomplete 204 — Request received, but returned message empty 205 — Server completed the request, The user agent must reset the file that has been currently browsed 206 — the server has completed part of the user’s GET request 3xx: Redirect response class, in order to complete the specified action, must accept further processing 300 — requested resource available in multiple places 301 — delete request data 302 — Request data found at another address 303 — Suggest client access to another URL or access method 304 — Client has performed GET, 305 — the requested resource must be obtained from the address specified by the server. 306 — the code used in the previous version of HTTP is no longer used in the current version. Client error, client request contains syntax error or does not execute 400 — error request correctly, Syntax error 401 — Unauthorized 402 — Remain valid ChargeTo header response 403 — Forbidden 404 — No file, query, or URl found 405 — Method defined in request-line field not allowed 406 — According to send Accept, 407 — The user must first be authorized on the proxy server 408 — The client did not complete the request within the specified time 409 — For the current resource state, Request cannot complete 410 — the server no longer has this resource and no further address 411 — the server rejects user-defined Content-Length 412 — one or more request header fields are in error 413 in the current request — The requested resource is larger than the server allows 416 — The request contains the Range request header field. There is no Range indicator in the current requested resource Range. The request also does not contain the if-range request header field 417 — the server does not meet the expected value specified by the request Expect header field, or If it is a proxy server, perhaps the next level of the server does not meet the request length. 5xx: server error, the server cannot execute a correct request correctly 500 — internal server error 501 — unimplemented 502 — gateway error

HTTP request packet

I believe you have a general understanding of the HTTP request process, let’s introduce HTTP request packet information in detail. The packet contains the request line, request header, and request body.

Let’s take a look at the content of an HTTP request message intercepted by a developer tool requesting www.baidu.com/ to compare the standard format above.

We found that the format of the request message is basically the same as the above, which is exactly what we want. So, let’s go through each of these pieces of information one by one.

The request line

is one of the HTTP request methods. HTTP/1.1 defines eight methods to interact with the server, including GET, POST, HEAD, PUT, DELETE, OPTIONS, TRACE, and CONNECT. The most commonly used methods are ** and **.

HEAD: gets the same response from the server as the GET request except for the request body
GET: obtain query resource information by URL (crawler specific URL crawler)
POST: Submit a form (mock login in a crawler)
PUT: uploads files (not supported by browsers)
DELETE: DELETE
OPTIONS: Returns HTTP request methods supported by the server for a specific resource
TRACE: returns a request received by the server to test or diagnose CONNECT: reserved for a piped proxy service

GET requests the method after the URL (/ in this case) and version 1.1, and don’t forget the space.

Request header

HTTP header fields include general header, request header, response header and entity header. Since we often submit headers headers for masquerading during crawlers, we will focus on headers here.

The request header is unique to the request message, which submits some additional information to the server. For example, with the Accept field information, the client can tell the server what type of data we Accept. We can treat these fields as **< key-value pairs >**.

Now let’s see what these fields mean.

`Accept`

Content: the text/HTML, application/XHTML + XML, application/XML. Q = 0.9, image/webp image/apng, /; Q =0.8 meaning: tells the browser that we accept the MIME type

`Accept-Encoding`

Deflate, deflate, deflate, deflate, deflate, deflate, deflate Note: generally do not add it to the crawler, the blogger initially did not understand all copy over, as a result of this is not good to get stuck for a long time.

`Accept-Lanague`

Content: useful – CN, useful; Q =0.9 Meaning: Indicates the language that the server can accept. If no language is available, it indicates any language

`Connection`

Keep-alive: tells the server that it needs a persistent connection status (HTTP1.1 makes persistent connections by default)

`Host`

Content: www.baidu.com Description: The client specifies the domain name /IP address and port number of the Web server that it wants to access

`Cache-control`

Max-age =0

Cache-control is the most important rule. This field is used to specify the instructions that all caching mechanisms must obey throughout the request/response chain. These directives specify actions to prevent caching from adversely interfering with a request or response. These instructions usually override the default cache algorithm. Cache instructions are one-way, meaning that the presence of one instruction in the request does not mean that the same instruction will be present in the response.

The web page Cache is controlled by the CACHE-control in the HTTP message header. The common values include private, no-cache, max-age, and must-revalidate. The default value is private.

However, cache-control for HTTP requests and responses is not exactly the same. Common cache-control values are

,

,

,

,

, and

.

The cache-control values in response are , ,

, < no-store >,

,

, , < Max – age >.

Here we focus on common cache-control values for requests.

<1>max-age<=0 In this example, max-age is 0, indicating that each request accesses the server. Last-modified is used to check whether the file is Modified. If the file is Modified, the status code 200 is returned and the latest file is obtained.

<2>max-age>0 indicates that the cache is fetched directly from the browser.

<3> No-cache indicates that the request is not fetched in the browser cache, but is forced to send to the server. This ensures that the client receives the most authoritative response.

<4>no-store All content will not be cached in cache or temporary Internet files.

`Upgrade-Insecure-Requests`

Description: Indicates that the browser/crawler can handle HTTPS and automatically upgrade requests from HTTP to HTTPS.

`User-Agent`

Content: Mozilla/5.0 (Windows NT 6.1; WOW64) .. Safari/537.36 meaning :(this is the most common crawler) used to request a web page disguised as a browser identity. This means, of course, the identity of the browser and the type of browser used for the operation.

`Cookies`

Cookies are used to maintain session state on the server side, written by the server side, and then read by the server on subsequent requests.

That’s all the field information in this example. Of course, there are other commonly used field information, which is also explained here.

Other request header field information

`Referer`

The client accesses the requested page from the page represented by the current URL. In crawlers, we usually just set it to the requested web page link.

`Accept-Charset`

The character set that the browser accepts, utF-8, GBK, etc

`If-Modified-Since`

Thu, 10 Apr 2008 09:14:42 GMT

`Pragma`

Meaning:

Pragma header fields are used to contain implementation-specific instructions. The most common is Pragma:no-cache. In HTTP/1.1, it has the same meaning as cache-control :no-cache.

`Range`:

Meaning: Tells the browser which part of the object it wants to fetch. Example: Range: bytes=1173546

conclusion

This article introduces the basic concepts of HTTP, including the following:

What is the HTTP
HTTP model, role, and positioning
A complete HTTP request process
HTTP request header information
Common fields in HTTP request headers

The next article will share some advanced content about HTTP, including the following:

Cookie
Sesssion
Finally, welcome everyone to leave a message to me, we can discuss together, learn crawler technology. Bloggers are also constantly learning and will continue to share in learning.

Python Data Science

Zhihu: zhuanlan.zhihu.com/pypcfx

Interpreting HTTP secrets in crawlers (Basics)

What is HTTP?

< Introduction to HTTP >

<HTTP model >

<HTTP location >

A complete HTTP request process

HTTP request packet

The request line

Request header

Accept

Accept-Encoding

Accept-Lanague

Connection

Host

Cache-control

Upgrade-Insecure-Requests

User-Agent

Cookies

Other request header field information

Referer

Accept-Charset

If-Modified-Since

Pragma

Range: