"Want to learn crawler must see series" master HTTP and HTTPS

knowledge

Understand the concepts and default ports of HTTP and HTTPS
Grasp the request and response headers that the crawler is concerned about
Understand common response status codes
Understand the difference between browser and crawler crawling

When we mention HTTP protocol, everyone will think of it as an application layer protocol, so what does HTTP protocol have to do with crawlers? See below:

1. Concepts and differences between HTTP and HTTPS

HTTPS is more secure than HTTP, but lower performance

HTTP: hypertext transfer protocol. The default port number is 80
- Hypertext: more than text, not just text; Also include pictures, audio, video and other files
- Transport protocol: a common convention used to pass hypertext content converted to a string
HTTPS: HTTP + SSL(Secure Socket Layer), that is, the hypertext transport protocol with secure Socket Layer. The default port number is 443
- SSL encrypts the content of the transmission (hypertext, that is, the body of the request or the body of the response)
You can open a browser to access a URL, right-click check, click net Work, click a URL, view the form of HTTP protocol

Knowledge: understand HTTP and HTTPS concepts and default ports

2. Request headers and response headers that crawlers pay special attention to

2.1 Request header fields of particular concern

The HTTP request takes the form shown above, and the crawler pays special attention to the following request header fields

Content-Type
Host (Host and port number)
Connection (link type)
Upgrade-insecure Requests (Upgrade to HTTPS Requests)
User-agent (browser name)
Referer (page jump)
Cookie (Cookie)
Authorization(used to represent authentication information that requires authentication resources in the HTTP protocol, such as for JWT authentication in the previous Web course)

Bold request header for common request header, the server was used to carry out the crawler recognition has the highest frequency, is more important than the rest of the request header, but it is important to note here does not mean that the rest is not important, because some site operations or developers may, way of doing things will use some more unusual request header for screening the crawler

2.2 Response header fields of special concern

The HTTP response takes the form shown above, with the crawler focusing on only one response header field

Set-cookie (the peer server sets the Cookie to the user’s browser cache)

Knowledge: Grasp the request and response headers that the crawler focuses on

3. Common response status codes

200: success
302: jump. The new URL is given in the Location header of the response
303: The browser redirects the response to the POST to the new URL
307: The browser’s response to GET redirects to the new URL
403: The resource is unavailable. The server understands the client’s request, but refuses to process it (no permissions)
404: The page could not be found
500: Server internal error
503: The server is not responding due to maintenance or overload, and the response may carry retry-After header. It is possible that the crawler frequently visits the URL, so that the server ignores the crawler’s request and finally returns 503 response status code

We are learning the web of knowledge was learned a status code of the relevant knowledge, we know this is server related feedback to me, we said at the time of learning is education should be real situation feedback to the client, but in the crawler, may be the site of developers or operations staff in order to prevent the crawler easily access to data, may be on a status code checkers, For example, the server has recognized that you are a crawler, but in order to make you careless, it returns status code 200, but there is no data on the response weight.

None of the status codes are trusted, depending on whether the data was obtained from the captured response

Knowledge: Understand common response status codes

4. Running process of the browser

After reviewing the HTTP protocol, let’s look at how the following browsers send HTTP requests

4.1 HTTP Request Process

After obtaining the IP address corresponding to the domain name, the browser sends a request to the URL in the address bar and obtains a response
In the response content (HTML) returned, with URLS such as CSS, JS, images, and ajax code, the browser sends other requests in the order of the response content and gets the corresponding response
Every time the browser gets a response, it will add (load) the results displayed. Js, CSS and other content will modify the content of the page. Js can also resend the request to get the response
The process from getting the first response and displaying it in the browser to finally getting the full response and adding content or modifying ———— to the displayed result is called browser rendering

4.2 note:

However, in the crawler, the crawler will only request the URL address and get the corresponding response (the content of the response can be HTML, CSS, JS, images, etc.).

In many cases, the page rendered by the browser is different from the page requested by the crawler, because the crawler does not have the rendering ability (of course, we will use other tools or packages to help the crawler render the response content in the following course).

The final result displayed by the browser is the result of multiple requests sent by multiple URL addresses and corresponding to multiple responses rendered together
Therefore, in crawler, it is necessary to take the response corresponding to a URL address that sends the request as the standard for data extraction

Knowledge point: understand that a browser display can be rendered by multiple responses to multiple requests, whereas a crawler is one response to one request

5. For other references about the HTTP protocol, read

Blog.csdn.net/qq_33301113…
Blog.csdn.net/qq_30553235…
Segmentfault.com/q/101000000…

I am white and white I, a love to share knowledge of the program yuan ❤️

If you have no contact with the programming section of the friends see this blog, find do not understand or want to learn Python, you can directly leave a message + private I ducky [thank you very much for your likes, collection, attention, comments, one button four connect support]

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

“Want to learn crawler must see series” master HTTP and HTTPS

1. Concepts and differences between HTTP and HTTPS

2. Request headers and response headers that crawlers pay special attention to

3. Common response status codes

4. Running process of the browser

5. For other references about the HTTP protocol, read

“Want to learn crawler must see series” master HTTP and HTTPS

1. Concepts and differences between HTTP and HTTPS

2. Request headers and response headers that crawlers pay special attention to

3. Common response status codes

4. Running process of the browser

5. For other references about the HTTP protocol, read

Related Posts

SpringBoot built-in lifecycle events

DefaultParameterHandler source code analysis | August more article challenge

Kafka SpringBoot integration