knowledge

  • Understand the concepts and default ports of HTTP and HTTPS

  • Grasp the request and response headers that the crawler is concerned about

  • Understand common response status codes

  • Understand the difference between browser and crawler crawling

When we mention HTTP protocol, everyone will think of it as an application layer protocol, so what does HTTP protocol have to do with crawlers? See below:

1. Concepts and differences between HTTP and HTTPS

HTTPS is more secure than HTTP, but lower performance

  • HTTP: hypertext transfer protocol. The default port number is 80

    • Hypertext: more than text, not just text; Also include pictures, audio, video and other files

    • Transport protocol: a common convention used to pass hypertext content converted to a string

  • HTTPS: HTTP + SSL(Secure Socket Layer), that is, the hypertext transport protocol with secure Socket Layer. The default port number is 443

    • SSL encrypts the content of the transmission (hypertext, that is, the body of the request or the body of the response)
  • You can open a browser to access a URL, right-click check, click net Work, click a URL, view the form of HTTP protocol

Knowledge: understand HTTP and HTTPS concepts and default ports

2. Request headers and response headers that crawlers pay special attention to

2.1 Request header fields of particular concern

The HTTP request takes the form shown above, and the crawler pays special attention to the following request header fields

  • Content-Type

  • Host (Host and port number)

  • Connection (link type)

  • Upgrade-insecure Requests (Upgrade to HTTPS Requests)

  • User-agent (browser name)

  • Referer (page jump)

  • Cookie (Cookie)

  • Authorization(used to represent authentication information that requires authentication resources in the HTTP protocol, such as for JWT authentication in the previous Web course)

Bold request header for common request header, the server was used to carry out the crawler recognition has the highest frequency, is more important than the rest of the request header, but it is important to note here does not mean that the rest is not important, because some site operations or developers may, way of doing things will use some more unusual request header for screening the crawler

2.2 Response header fields of special concern

The HTTP response takes the form shown above, with the crawler focusing on only one response header field

  • Set-cookie (the peer server sets the Cookie to the user’s browser cache)

Knowledge: Grasp the request and response headers that the crawler focuses on

3. Common response status codes

  • 200: success

  • 302: jump. The new URL is given in the Location header of the response

  • 303: The browser redirects the response to the POST to the new URL

  • 307: The browser’s response to GET redirects to the new URL

  • 403: The resource is unavailable. The server understands the client’s request, but refuses to process it (no permissions)

  • 404: The page could not be found

  • 500: Server internal error

  • 503: The server is not responding due to maintenance or overload, and the response may carry retry-After header. It is possible that the crawler frequently visits the URL, so that the server ignores the crawler’s request and finally returns 503 response status code

We are learning the web of knowledge was learned a status code of the relevant knowledge, we know this is server related feedback to me, we said at the time of learning is education should be real situation feedback to the client, but in the crawler, may be the site of developers or operations staff in order to prevent the crawler easily access to data, may be on a status code checkers, For example, the server has recognized that you are a crawler, but in order to make you careless, it returns status code 200, but there is no data on the response weight.

None of the status codes are trusted, depending on whether the data was obtained from the captured response

Knowledge: Understand common response status codes

4. Running process of the browser

After reviewing the HTTP protocol, let’s look at how the following browsers send HTTP requests

4.1 HTTP Request Process

  1. After obtaining the IP address corresponding to the domain name, the browser sends a request to the URL in the address bar and obtains a response

  2. In the response content (HTML) returned, with URLS such as CSS, JS, images, and ajax code, the browser sends other requests in the order of the response content and gets the corresponding response

  3. Every time the browser gets a response, it will add (load) the results displayed. Js, CSS and other content will modify the content of the page. Js can also resend the request to get the response

  4. The process from getting the first response and displaying it in the browser to finally getting the full response and adding content or modifying ———— to the displayed result is called browser rendering

4.2 note:

However, in the crawler, the crawler will only request the URL address and get the corresponding response (the content of the response can be HTML, CSS, JS, images, etc.).

In many cases, the page rendered by the browser is different from the page requested by the crawler, because the crawler does not have the rendering ability (of course, we will use other tools or packages to help the crawler render the response content in the following course).

  • The final result displayed by the browser is the result of multiple requests sent by multiple URL addresses and corresponding to multiple responses rendered together

  • Therefore, in crawler, it is necessary to take the response corresponding to a URL address that sends the request as the standard for data extraction

Knowledge point: understand that a browser display can be rendered by multiple responses to multiple requests, whereas a crawler is one response to one request

5. For other references about the HTTP protocol, read

  • Blog.csdn.net/qq_33301113…

  • Blog.csdn.net/qq_30553235…

  • Segmentfault.com/q/101000000…

I am white and white I, a love to share knowledge of the program yuan ❤️

If you have no contact with the programming section of the friends see this blog, find do not understand or want to learn Python, you can directly leave a message + private I ducky [thank you very much for your likes, collection, attention, comments, one button four connect support]