😀 This is the 4th original crawler column

In the process of browsing the website, we often encounter the need to log in, some pages can only be accessed after logging in, and after logging in, you can visit the website for many times, but sometimes after a period of time you need to log in again. There are some websites, when opening the browser on the automatic login, and will not be invalid for a long time, this situation is why? In fact, there are some related knowledge about Session and Cookie, this section will reveal their mystery.

1. Static and dynamic web pages

Before we begin, we need to understand the concepts of static and dynamic web pages. Here is the same sample code as before:

<! DOCTYPEhtml>
<html>
  <head>
    <meta charset="UTF-8" />
    <title>This is a Demo</title>
  </head>
  <body>
    <div id="container">
      <div class="wrapper">
        <h2 class="title">Hello World</h2>
        <p class="text">Hello, this is a paragraph.</p>
      </div>
    </div>
  </body>
</html>
Copy the code

This is the most basic HTML code. We save it as a test. HTML file, and then put it on a host with a fixed public IP, Apache or Nginx installed on the host, so that the host can act as a server. Other people can visit the server and see the page, which makes for a very simple website.

The content of this web page is WRITTEN by HTML code, text, pictures and other content are written by THE HTML code to specify, this page is called a static page. It is fast to load, easy to write, but there are great defects, such as poor maintainability, can not flexibly display content according to URL, etc.. For example, if we want to pass a name parameter to the URL of this web page, it will not be displayed in the web page.

Therefore, dynamic web pages emerge at the historic moment, it can dynamically resolve the changes in URL parameters, associated database and dynamic presentation of different page content, very flexible and changeable. Most of the sites we encounter today are dynamic sites that are no longer simple HTML, but perhaps written in JSP, PHP, Python, etc., which are much more powerful and rich than static web pages. In addition, dynamic website can also realize the function of user login and registration.

Going back to the question at the beginning, many pages require login to view. By general logic, after logging in with a username and password, we must be given something like a credential that allows us to stay logged in and access pages that we can only see after logging in.

So what exactly is this mysterious credential? In fact, it is the result of both Session and Cookie. Let’s take a look.

2. Stateless HTTP

Before we get to sessions and cookies, we need to understand a feature of HTTP called statelessness.

HTTP stateless means that HTTP protocol has no memory for transaction processing, that is, the server does not know the state of the client. When we send a request to the server, the server parses the request and returns the corresponding response. The server is responsible for this process, and this process is completely independent, the server does not record the state changes before and after, that is, the lack of state records. This means that if the previous information needs to be processed later, it must be retransmitted, which results in the need to pass some additional repeated previous requests in order to get the subsequent response, which is obviously not the desired effect. In order to maintain the forward and backward state, we certainly can’t retransmit all previous requests at once, which would be a waste of resources, especially for pages that require users to log in.

This is where two techniques for maintaining HTTP connections emerge, Session and Cookie. Session On the server, that is, the server of the website, is used to store the user’s Session information. Cookie on the client side can also be understood as the browser side. With Cookie, the browser will automatically attach it to the server when it visits the web page next time. The server will identify the user by identifying the Cookie, and then judge whether the user is in login state, and then return the corresponding response.

We can understand that the Cookie saves the login credentials. With it, we only need to send the request with the Cookie in the next request without re-entering the user name, password and other information to re-log in.

Therefore, in crawler, sometimes when dealing with pages that require login to access, we generally directly put the cookies obtained after successful login in the request header to request directly, without simulating login again.

Now that we understand the concepts of Session and Cookie, let’s take a closer look at how they work.

3. Session

Session, in Chinese, refers to a series of actions/messages that begin and end. For example, when making a phone call, the sequence of events between picking up the phone and dialing and ending the call can be called a Session.

On the Web, Session objects are used to store attributes and configuration information needed for a particular user’s Session. This way, variables stored in the Session object will not be lost when the user jumps between Web pages of the application, but will persist throughout the user’s Session. When a user requests a Web page from an application, if the user does not already have a Session, the Web server automatically creates a Session object. When a Session expires or is abandoned, the server terminates the Session.

4. Cookie

Cookie, also commonly used in its plural form, refers to data stored on a user’s local terminal by some websites for identification and Session tracking.

The Session to maintain

So how do we use Cookies to stay state? When the client requests the server for the first time, the server will return a response with a set-cookie field in the response header to the client, which is used to mark the user. The client browser will save the Cookies. When the browser requests this website next time, the browser will put the Cookies in the request header and submit them to the server. The Cookies carry the Session ID information, and the server can check the Cookies to find the corresponding Session. Session is then used to identify user status.

Upon successful login to a website, the server tells the client which Cookies to set. On subsequent visits to the page, the client sends Cookies to the server, which then finds the corresponding Session to determine. If some of the Session variables that set the login state are valid, the user is logged in and is returned to the page content that can be viewed after the login, which the browser can parse to see.

On the other hand, if the Cookies sent to the server are invalid or the Session has expired, we will not be able to continue to access the page and may receive an incorrect response or jump to the login page to log in again.

Therefore, Cookies and Session need to cooperate, one on the client side and the other on the server side, so as to realize login Session control.

Attribute structure

Next, let’s take a look at what Cookies are. Here take Zhihu as an example. Open the Application TAB in the browser developer tool, and then there will be a Storage part on the left. The last item is Cookies, and click on it, as shown in the figure.

Cookies list

As you can see, there are many entries, each of which can be called a Cookie. It has the following properties.

  • Name, which is the Name of the Cookie. Once a Cookie is created, the name cannot be changed.
  • Value, which is the Value of the Cookie. If the value is a Unicode character, you need to encode the character. If the value is binary data, BASE64 encoding is required.
  • Domain, that is, the Domain name through which the Cookie can be accessed. For example, if the Cookie is set to.zhihu.com, all domain names ending in zhihu.com can access the Cookie.
  • Path, which is the usage Path of the Cookie. If set to /path/, only pages whose path is /path/ can access the Cookie. If the Cookie is set to /, all pages under the domain name can access the Cookie.
  • Max-age, which is the time the Cookie Expires in seconds, is often used with Expires to calculate the expiration time. Max-age If the value is positive, the Cookie expires after the max-age seconds. If it is negative, the Cookie becomes invalid when the browser is closed, and the browser does not save the Cookie in any form.
  • Size field, which is the Size of this Cookie.
  • HTTP field, that is, CookiehttponlyProperties. If this property istrue, then only the HTTP Headers will carry the Cookie information, but will not passdocument.cookieTo access this Cookie.
  • Secure: indicates whether the Cookie is transmitted only using a Secure protocol. Secure protocols such as HTTPS and SSL encrypt data before transmission over the network. The default value isfalse.

Session cookies and persistent cookies

On the surface, session cookies are stored in the browser memory, which will become invalid after the browser closes. The persistent Cookie is saved to the hard disk of the client and can be used next time to keep the user logged in for a long time.

In fact, strictly speaking, there is no session Cookie or persistent Cookie, only the Cookie max-age or Expires field determines the expiration time.

Therefore, some persistent login sites actually set the Cookie validity period and Session validity period to be long, and the next time we visit the page, we still carry the previous Cookie, so we can directly maintain the login state.

5. Common mistakes

When talking about the Session mechanism, it is a common misconception that “just close the browser and the Session is gone”. Consider the example of membership cards. Unless the customer takes the initiative to cancel the card, the store will never delete the customer’s information easily. The same is true for sessions, which persist until the application tells the server to remove a Session. For example, programs typically delete sessions only when we log out.

But when we close the browser, the browser doesn’t actively notify the server that it’s going down before closing, so the server doesn’t have a chance to know that the browser is down. The reason for this illusion is that most web sites use Session Cookies to store Session ID information. When you close the browser, the Cookies disappear, and when you connect to the server again, the original Session cannot be found. If the Cookies set by the server are saved to the hard disk, or the ORIGINAL Cookies are sent to the server by rewriting the HTTP request header sent by the browser by some means, the original Session ID can still be found when the browser is opened again, and the login state can still be maintained.

And since closing the browser does not cause the Session to be deleted, the server needs to set an expiration date for the Session. If the expiration date is longer than the last time the client used the Session, the server can assume that the client has ceased to be active. Delete Session to save storage space.

6. Summary

This section introduces the basic concepts of Session and Cookie, which are of great help to the development of web crawler later in this paper and need to be well mastered.

As some professional terms are involved, the reference sources for part of this section are as follows:

  • Documentation – the Session – baidu encyclopedia: baike.baidu.com/item/sessio…
  • Documentation – Cookie – baidu encyclopedia: baike.baidu.com/item/cookie…
  • Documentation – HTTP cookies wikipedia: en.wikipedia.org/wiki/HTTP_c…
  • Blog-session and several state retention schemes to understand: www.mamicode.com/info-detail…

Thank you very much for reading. For more exciting content, please pay attention to my public account “Attack Coder” and “Cui Qingcai | Jingmi”.