In this section, we’ll take a closer look at the basics of HTTP and see what happens between typing in a URL in the browser and retrieving web content. Knowing these contents will help us further understand the basic principle of crawler.
1. The URI and URL
Here we first learn about URIs and urls. The full name of URIs is Uniform Resource Identifier, and the full name of urls is Universal Resource Locator.
For example, github.com/favicon.ico is GitHub’s icon link, which is both a URL and a URI. There is an icon resource whose access is uniquely specified using a URL/URI. This includes the access protocol HTTPS, the access path (/ is the root directory), and the resource name favicon.ico. This resource can be found on the Internet through a link called the URL/URI.
A URL is a subset of a URI, which means that every URL is a URI, but not every URI is a URL. So, what kind of URI is not a URL? Uris also include a subclass called URN, whose full Name is Universal Resource Name. URN names the resource without specifying how to locate it. For example, URN: ISBN :0451450523 specifies the ISBN of a book that uniquely identifies the book, but does not specify where to locate the book. Figure 2-1 shows the relationship between URLS, UrNs, and URIs.
Figure 2-1 Relationship between URLS, UrNs, and URIs
But urNs are rarely used on the Internet today, so almost all URIs are urls. Common web links can be called either URLS or URIs, but I prefer to call them urls.
2. Hypertext
Next, we understand a concept — hypertext, its English name is called hypertext, we see the web page in the browser is hypertext parsing, the source code of the web page is a series of HTML code, which contains a series of tags, such as IMG display image, P specified display paragraph and so on. After the browser parses these tags, it forms the web page we see everyday, and the source code of the web page, HTML, is called hypertext.
For example, we open any page in Chrome, such as taobao page, right-click anywhere and select “check” (or press F12) directly, can open the browser developer tools, at this time in the Elements TAB you can see the current page’s source code, the source code is a hypertext, as shown in figure 2-2.
Figure 2-2 Source code
3. HTTP and HTTPS
On taobao’s home page www.taobao.com/, the URL begins with HTTP or HTTPS, which is the protocol type required to access resources. Sometimes we also see urls that start with FTP, SFTP, and SMB, all of which are protocol types. In crawlers, the pages we grab are usually HTTP or HTTPS protocols. Here we first understand the meaning of these two protocols.
HTTP stands for Hyper Text Transfer Protocol. HTTP protocol is used to transfer hypertext data from network to local browser. It can ensure efficient and accurate transmission of hypertext documents. HTTP is a specification jointly developed by the World Wide Web Consortium (W3C) and the Internet Engineering Task Force (IETF). Currently, HTTP version 1.1 is widely used.
HTTPS is the Hyper Text Transfer Protocol over Secure Socket Layer. It is an HTTP channel aimed at security. In short, IT is the Secure version of HTTP, that is, ADDING SSL Layer to HTTP.
The security of HTTPS is based on SSL. Therefore, the content transmitted through HTTPS is encrypted by SSL. The functions of HTTPS can be divided into two aspects.
- Establish an information security channel to ensure the security of data transmission.
- Verify the authenticity of the website. For HTTPS websites, you can click the lock icon in the address bar of the browser to view the real information after the website authentication or use the security seal issued by the CA.
Now more and more websites and apps are developing towards HTTPS, for example:
- Apple is forcing all iOS apps to use HTTPS encryption by January 1, 2017, otherwise they will not be available in the App Store.
- Starting with Chrome 56, which was launched in January 2017, Google has put a risk alert on links that are not encrypted with HTTPS. That is, it prominently reminds users that “this page is not secure”.
- Tencent wechat mini program’s official requirements document requires the background to use HTTPS requests for network communication. Domain names and protocols that do not meet the requirements cannot be requested.
For example, if you open 12306 in Chrome and enter www.12306.cn/, the message “Your connection is not private” is displayed, as shown in Figure 2-3.
Figure 2-3 12306 page
This is because the CA certificate of 12306 is issued by the Ministry of Railways of China, and this certificate is not trusted by THE CA organization, so the certificate verification will not pass, but in fact its data transmission is still encrypted through SSL. If you want to climb such a site, you need to set the option to ignore the certificate or you will be prompted with an SSL link error.
4. HTTP request process
We type a URL into the browser, press Enter and see the page content in the browser. In effect, the browser sends a request to the server where the web site is located. The web server receives the request, processes it, parses it, and returns the corresponding response, which is then passed back to the browser. The response contains the source code and other content of the page, which is then parsed by the browser to present the web page, as shown in Figure 2-4.
Figure 2-4 Model diagram
In this case, the client represents our own PC or mobile browser, and the server is the server where the website to be accessed resides.
To illustrate the process more visually, here’s a demo using Chrome’s Developer mode Network Listener component, which displays all Network requests and responses that occur when visiting the currently requested page.
Open Chrome, right-click and choose Check to open the browser’s developer tools. Go to baidu www.baidu.com/, type the URL and press Enter to see what kind of network requests occur. As can be seen, items appear one by one at the bottom of the Network page, one of which represents the process of sending a request and receiving a response at a time, as shown in Figure 2-5.
Figure 2-5 Network panel
Let’s look at the first network request, www.baidu.com.
The meaning of each column is as follows.
- The first column Name: the Name of the request, usually the last part of the URL.
- The second column Status: the Status code of the response, which is displayed as 200, indicating that the response is normal. Using the status code, we can determine whether the request received a normal response after it was sent.
- The third column Type: the document Type requested. This is called document, which means what we’re asking for this time is an HTML document, and the content is just some HTML code.
- Column 4 Initiator: request source. Used to mark the object or process from which the request originated.
- Column 5 Size: the Size of the file downloaded from the server and the requested resource. If the resource is fetched from the cache, this column displays from Cache.
- Column 6 Time: The total Time from the Time the request was initiated to the Time the response was received.
- 7th column Waterfall: Visual Waterfall flow with network requests.
Click on the entry to see more details, as shown in Figure 2-6.
Figure 2-6 Details
Request URL = Request Method = Request Method = Status Code = response Status Code = Remote Address = Address and port of the Remote server Referrer Policy refers to the Referrer discrimination Policy.
Further down, you can see that there are Response Headers and Request Headers, which represent the Response Headers and Request Headers, respectively. The request header contains a lot of request information, such as browser id, Cookies, Host, etc., which is a part of the request. The server will judge whether the request is legitimate according to the information in the request header, and then make corresponding response. The Response Headers shown in the figure is part of the Response, which contains information such as server type, document type, and date. After receiving the Response, the browser will parse the Response to present the web page content.
Let’s take a look at what’s involved in the request and response.
5. Request
A Request is sent by the client to the server and consists of four parts: Request Method, Request URL, Request Headers, and Request Body.
(1) Request method
There are two common request methods: GET and POST.
Type the URL directly into the browser and press Enter. This initiates a GET request, and the parameters of the request are included directly in the URL. For example, if you search for Python in Baidu, this is a GET request, linked to www.baidu.com/s?wd=Python, where the URL contains the parameters of the request, where wd indicates the keyword to be searched. POST requests are mostly made when the form is submitted. For example, for a login form, clicking the “Login” button after entering a user name and password usually initiates a POST request whose data is usually transferred as a form rather than reflected in a URL.
There are differences between the GET and POST request methods.
- The parameters in the GET request are contained in the URL, and the data can be seen in the URL, while the URL of the POST request does not contain the data. The data is transmitted through the form, and will be contained in the request body.
- GET requests submit a maximum of 1024 bytes of data, while POST has no limit.
Generally speaking, when you log in, you need to submit the user name and password, which contains sensitive information. If you use GET mode, the password will be exposed in the URL, resulting in password leakage. Therefore, it is best to send the password in POST mode. POST is also used to upload a large file.
Most of the requests we encounter are GET or POST requests. In addition, there are some request methods, such as GET, HEAD, POST, PUT, DELETE, OPTIONS, CONNECT, TRACE, etc., which are briefly summarized in Table 2-1.
Table 2-1 Other request methods
methods |
describe |
---|---|
GET |
Request the page and return the page content |
HEAD |
Similar to a GET request, except that there is no concrete content in the response returned, which is used to retrieve the header |
POST |
Most are used to submit forms or upload files, and the data is contained in the request body |
PUT |
Data sent from the client to the server replaces the contents of the specified document |
DELETE |
Asks the server to delete the specified page |
CONNECT |
Use the server as a springboard to access other web pages instead of the client |
OPTIONS |
Allows clients to view server performance |
TRACE |
The command output displays the requests received by the server for testing or diagnosis |
This table reference: www.runoob.com/http/http-m… .
(2) Requested url
The requested URL, the Uniform Resource Locator URL, uniquely identifies the resource we want to request.
(3) the request header
The request header is used to describe the additional information to be used by the server, such as Cookie, Referer, user-agent and so on. Some common headers are briefly described below.
- Accept: Request header field that specifies what types of information the client can Accept.
- Accept-language: Specifies the Language type acceptable to the client.
- Accept-encoding: Specifies the content Encoding acceptable to the client.
- Host: Specifies the IP address and port number of the Host requesting the resource. The content is the location of the original server or gateway requesting the URL. As of HTTP 1.1, requests must include this content.
- Cookie: Also commonly used in the plural, Cookies. These are data stored locally by a web site in order to identify whether a user is tracking a session. Its main function is to maintain the current access session. For example, after we enter the user name and password to log in to a website, the server will use the session to save the login status information. Later, when we refresh or request other pages of the site, we will find that the login status is all due to Cookies. Cookies contain information that identifies the session of our corresponding server. Every time the browser requests the page of this site, it will add Cookies in the request header and send them to the server. The server identifies ourselves through Cookies and finds out that the current state is login state. So the return result is the content of the web page that you can only see after logging in.
- Referer: This content is used to identify the page from which the request was sent. The server can get this information and do corresponding processing, such as source statistics, anti-theft processing, etc.
- User-agent: UA for short. UA is a special string header that enables the server to identify information such as the operating system and version, browser and version used by the customer. Adding this information to a crawler can masquerade as a browser; If not, it is likely to be recognized as a reptile.
- Content-type: Also called Internet Media Type or MIME Type. In HTTP headers, it is used to indicate the Media Type in a specific request. For example, text/ HTML stands for HTML format, image/ GIF stands for GIF image, and application/json stands for JSON type. You can check the mapping table at tool.oschina.net/commons for more information.
Therefore, the request header is an important part of the request, and it is necessary to set the request header in most cases when writing the crawler.
(4) the request body
The request body typically carries the form data in a POST request, whereas for A GET request, the request body is empty.
For example, the request and response captured when I log into GitHub is shown in Figure 2-7.
Figure 2-7 Details
Before login, we fill in the user name and password information, which will be submitted to the server as form data. Note that the Content-type specified in Request Headers is Application/X-www-form-urlencoded. Form data will be submitted only if the content-type is set to Application/X-www-form-urlencoded. Alternatively, we can set the content-type to Application /json to submit JSON data, or multipart/form-data to upload files.
Table 2-2 lists the relationship between the Content-Type and POST data submission methods.
Table 2-2 Relationship between the content-Type and POST data submission modes
Content-Type |
The way data is submitted |
---|---|
application/x-www-form-urlencoded |
The form data |
multipart/form-data |
Form file upload |
application/json |
Serialize JSON data |
text/xml |
The XML data |
In crawler, if POST request is to be constructed, it is necessary to use the correct Content-Type and understand which content-Type is used for each parameter setting of various request libraries, otherwise the normal response may not be possible after POST submission.
6. The response
A Response is returned by the server to the client and can be divided into Response Status Code, Response Headers, and Response Body.
(1) Response status code
The response status code indicates the response status of the server. For example, 200 indicates that the server is responding properly, 404 indicates that the page is not found, and 500 indicates that an internal error occurs on the server. In crawler, we can judge the response state of the server according to the status code. If the status code is 200, it will prove that the data has been successfully returned and further processing will be carried out. Otherwise, it will be directly ignored. Table 2-3 lists the common error codes and causes.
Table 2-3 Common error codes and causes
Status code |
instructions |
details |
---|---|---|
100 |
Continue to |
The requester should continue to make the request. The server has received part of the request and is waiting for the rest |
101 |
Switch protocols |
The requester has asked the server to switch protocols, and the server has confirmed and is ready to switch |
200 |
successful |
The server has successfully processed the request |
201 |
Has been created |
The request succeeds and the server creates a new resource |
202 |
Have accepted |
The server has accepted the request but has not yet processed it |
203 |
Unauthorized information |
The server has successfully processed the request, but the information returned may come from another source |
204 |
There is no content |
The server successfully processed the request, but did not return anything |
205 |
Reset the content |
The server successfully processed the request and the content was reset |
206 |
Part of the content |
The server successfully processed some of the requests |
300 |
A variety of options |
The server can perform a variety of actions on a request |
301 |
A permanent move |
The requested web page has been permanently moved to a new location, a permanent redirect |
302 |
Temporary mobile |
The requested page is temporarily redirected to another page |
303 |
Look at other locations |
If the original request was POST, the redirected target document should be extracted via GET |
304 |
unmodified |
The page returned by this request has not been modified, continuing to use the last resource |
305 |
Using the agent |
The requester should access the page using a proxy |
307 |
Temporary redirection |
The requested resource temporarily responds from another location |
400 |
Bad request |
The server was unable to resolve the request |
401 |
unauthorized |
The request was not authenticated or failed authentication |
403 |
Blocking access |
The server rejected the request |
404 |
Not found |
The server could not find the requested page |
405 |
Method to disable |
The server disabled the method specified in the request |
406 |
Don’t accept |
The requested web page cannot be responded to with the requested content |
407 |
Agency authorization is required |
The requester needs to use proxy authorization |
408 |
The request timeout |
Server request times out |
409 |
conflict |
A server conflict occurred while completing a request |
410 |
deleted |
The requested resource has been permanently deleted |
411 |
Required effective length |
The server will not accept requests that do not contain a valid content-length header field |
412 |
Prerequisites are not met |
The server did not meet one of the prerequisites that the requester set in the request |
413 |
Request entity too large |
Request entities are too large for the server to handle |
414 |
The request URI is too long |
The requested URL is too long for the server to process |
415 |
Unsupported types |
The request format is not supported by the requested page |
416 |
Inconsistent scope of request |
The page cannot provide the requested scope |
417 |
Unsatisfied expected value |
The server did not meet the requirements for the expected request header field |
500 |
Server internal error |
The server encountered an error and could not complete the request |
501 |
Unrealized. |
The server does not have the capability to complete the request |
502 |
Bad gateway |
The server, acting as a gateway or proxy, received an invalid response from the upstream server |
503 |
Service unavailable |
The server is currently unavailable |
504 |
Gateway timeout |
The server acts as a gateway or proxy, but does not receive requests from the upstream server in time |
505 |
The HTTP version is not supported |
The server does not support the HTTP protocol version used in the request |
(2) the response headers
The response header contains the Server’s response to the request, such as Content-Type, Server, and set-cookie. Some common headers are briefly described below.
- Date: indicates the time when the response is generated.
- Last-modified: Specifies the time when the resource was Last Modified.
- Content-encoding: Specifies the Encoding of the response Content.
- Server: Contains information about the Server, such as the name and version number.
- Content-type: Specifies the Type of data to be returned. For example, text/ HTML means to return AN HTML document, Application/X-javascript means to return a javascript file, and image/ JPEG means to return an image.
- Set-cookie: Sets Cookies. Set-cookie in the response header tells the browser that it needs to place this content in Cookies, and the next request carries the Cookie request.
- Expires: Specifies an expiration time for a response that causes the proxy server or browser to update the loaded content into the cache. If accessed again, it can be loaded directly from the cache, reducing the load on the server and shortening the load time.
(3) the response body
Most important is the content of the response body. The body data of the response is in the response body. For example, when a web page is requested, its response body is the HTML code of the web page. When an image is requested, its response body is the binary data of the image. After we do crawler request web page, the content to be resolved is the response body, as shown in Figure 2-8.
Figure 2-8 Response body content
Click Preview in the browser developer tools, and you’ll see the source code for the web page, the contents of the response body, which is the target for parsing.
When doing crawler, we mainly get the source code and JSON data of the web page through the response body, and then extract the corresponding content from it.
In this section, we’ve looked at the basics of HTTP, and roughly the request and response process behind accessing a web page. This section covers a lot of information that needs to be understood and is often used when analyzing web requests.
This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience
For more crawler information, please follow my personal wechat official account: Attack Coder
Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)