Some friends who have had a job interview (especially the position of crawler) should often be asked, do you know what you did behind the browser after entering baidu’s website? In fact, this is mainly to see your understanding of the computer network protocol, today small handsome B with you talking about this problem, before this, we first to understand what is URL.

Do you really understand what a URL is?

We often use the browser to surf the Internet. When we want to query some websites, we will enter a string in the address bar of the browser, which is often called “URL”. In fact, the “URL” refers to the URL, the full name of URL is Uniform Resource Location. It’s called unified resource targeting.

For example, if we want to visit Google, we should type in the browser’s address bar:

http://www.google.com.

In this case, we can see two important parts of the URL, one is the protocol, the other is the resource name, they are separated by “://”, the left HTTP is the protocol, the right www.google.com is the resource name.

The protocol we use here is HTTP protocol, his full name is called hypertext transfer protocol, there are many network request protocols, for example, we are familiar with other protocols have FTP protocol, HTTPS protocol and so on.

Why use protocols?

Because there are no rules, no fangyuan, at the time of data transmission, we need through the corresponding rules to obtain the corresponding resources, such as you are at home, suddenly hungry, want takeout, so you need to find you want, and then pay, businesses do, only to take-out little brother just smile happily of give you, if you don’t follow this rule, you don’t give money, can you eat? Will the delivery guy smile and bring it to you? Unless he’s handsome.

So the HTTP protocol that we’re using here gives us hypertext documents.

The name of the resource

A resource name is a complete address. The format of the resource name depends on the protocol, but in most protocols the resource name includes:

1.Host Name: the Host Name is the Name of the server, usually a domain Name, that is, the IP address of the server corresponding to the domain Name. For example, www.google.com in http://www.google.com is the Host Name.

2.FileName: indicates that we want to access a file at a location on the server where the path name of the file is FileName. For example, if we want to access the photo in the teacher directory on server A, then we can access it like this:

http://www.a.com/teacher/ photos. JPG

Teacher/photo. JPG is FileName.

3.Port Number: Port number, this is used to connect to port, we don’t need to input the default access port, because 80 is the default connection, generally have a 0-65535 ports on the server, what port he is open to visit you, you can only pass it to your port for a visit, just as you are going to check in, found that the hotel has 65536 rooms, Then the customer service staff tells you that the 8000th room is available, so you pay the money and take the room card to the room 8000 to play, you will not go to 65536 rooms to play all over it!

This port is usually followed by a colon at the end and the number of the upper end of the colon. For example: http://www.google.com:80

E.g.Parameters can be used to access specific resources, generally add key-value after the address to visit the acquaintance of the value to visit, for example, we want to visit a website of the teacher directory first to the tenth wave of The wild jieyi photos can be accessed like this:

http://www.a.com/teacher/pic/boduoyejieyi?start=1&end=10

Okay, so what happens when you type baidu.com into your browser

  1. You type baidu.com in the address bar of Chrome



2.Chrome uses DNS to search for the IP address of baidu.com:




The DNS lookup process looks like this:

Chrome searches for cached DNS records. If it does not find the desired DNS record in the browser cache, it makes a system call to obtain the DNS record from the system cache.

If the request is not logged, it will continue to the router, which has its own DNS cache;

If there is no record, it will check the record in ISP’s DNS cache.

If there’s no record it’s going to go to the ISP’s DNS server and start a recursive search from the root server to the DNS server and get the IP address.

The browser sends an HTTP request to the Baidu server




After obtaining the IP address of Baidu, we can send an HTTP request to baidu server. When we send the request through THE URL address, it is a GET request, and at this time, a header message will be sent to Baidu server:

Accept:text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng, * / *; Q =0.8 Accept-encoding :gzip, deflate, BR Accept-language: zh-cn,zh; Q = 0.8 cache-control: no - Cache Connection: keep alive - cookies: PSTM = 1506157985; BIDUPSID=DA662DF344C147D17FB4828CCD795292; . Host:www.baidu.com Pragma:no-cache upgrade-insecure -Requests:1 User-agent :Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36
Copy the code


If you write a lot about anti-crawl you know this

  1. User-agent provides information such as browser type, operating system version, browsing plug-in and browser language to Baidu server.
  2. Accept tells the server what type we want to receive.
  3. Connection:keep-alive does not close the TCP Connection for subsequent requests.
  4. Cookies are stored in the form of text and are sent to the server every time a request is made. They can store the user’s status, user name and other information.


4. Baidu server 301 redirects response




Since we typed baidu.com in Chrome instead of www.baidu.com, baidu server responds with a 301 permanent redirect to www.baidu.com.

5.Chrome requests the address to be redirected


At this point Chrome knows that www.baidu.com is the address that Baidu wants to visit, so Chrome sends another request to the Baidu server.

6. Baidu server processes the request

When the baidu server receives the request at this time, it will check the parameters and cookie information of the request, and then conduct some operations, such as storing the data and obtaining the data to be requested from the database.

7. Baidu server returns AN HTML response


After being processed by baidu server, baidu server will return data to the browser. In this case, a Response Headers will be sent to the browser:

Bdpagetype:1 Bdqid:0xddf2be49000b5995 Bduserid:0 Cache-Control:private Connection:Keep-Alive Content-Encoding:gzip Content-Type:text/html; charset=utf-8 Cxy_all:baidu+09720a4fa84e5493ae7506a57de6bc05 Date:Sat, 14 Oct 2017 09:39:32 GMT Expires:Sat, 14 Oct 2017 09:39:32 GMT Server:BWS/1.1 SET-cookie :BDSVRTM=49; path=/ Set-Cookie:BD_HOME=0; path=/ Set-Cookie:H_PS_PSSID=1440_13551_21103_24658; path=/; domain=.baidu.com Strict-Transport-Security:max-age=172800 Transfer-Encoding:chunked Vary:Accept-Encoding X-Powered-By:HPHP X-Ua-Compatible:IE=Edge,chrome=1Copy the code


Response Headers specifies whether or not to cache the page, how to interpret the Response information, cookie Settings, privacy information, etc.

Among them

  • Content-encoding: Gzip tells the browser that the entire response body is compressed using the GZIP algorithm.
  • Content-Type:text/html; Charset = UTF-8 tells the browser to render the content of the response in HTML with utF-8 character set.


8. Display baidu page in Chrome

At this time, Chrome browser gets the response content and starts to display the HTML page of Baidu. When the browser displays, it finds that it needs to obtain other tag content, such as images, CSS style sheets, JavaScript files. Therefore, the browser will continue to send requests for these static files to baidu server. Baidu will cache them and distribute them using content Distribution Network (CDN), so these static files are backed up in many CDN data centers, so the browser can get these static files quickly.

To complete the display of the page bar:


Of course, this is just a brief introduction, so that you have a clear understanding, if you are interested in networking, to understand the composition of computer networks, what is message, what is packet switching, how to shake hands and the relationship between the layers, such as the physical layer, the data link layer, the transport layer and so on.