Headlines big Brother patiently ask: from input URL to page display, you know what point?

The body of the

When you see this question, will not have tears in your eyes, the topic of deep love it. I recited the topic for a long time was finally asked, but the reality is always so cruel, I was asked about the scene is like this:

Interviewer: Can you talk specifically about what happened from the time you typed the URL to the page presentation?

Interviewer: I have a deep understanding of this block.

Interviewer: I know you can’t tell me everything, but just tell me what you know.

Interviewer: How do you know I can’t tell you everything?

Interviewer: First, the browser will parse the entered URL

Then began a soul-searching:

Why do YOU need to encode urls, by the way urIs and urls

Here’s a simple breakdown of the problem:

The URI and URL

First of all, let’s make a definition:

URL: RFC1738 Uniform Resource Locator

URN: RFC2141 Uniform Resource Name

URI: RFC1630 Uniform Resource Identifier

In fact, from such an official language, I actually do not understand what has been done? I just looked at the official explanation

URL: Commonly known as a web address, is the address of a standard resource on the Internet. Expect to provide a way to find the resource

URN: Expects to provide a persistent, location-independent identification of a resource. And allows multiple namespaces to be easily mapped to a single URN namespace

Uris are used to distinguish resources and are supersets of urls and URNs to replace the concepts of URLS and UrNs

Uris or URLS you can specify

Here I’ll explain the URI first, and then parse each part step by step.

The Resource resources

A Resource can be an image, document, or information. It can also be an entity that cannot be accessed through the network. It can also be some abstract concepts.
A resource can have multiple URIs

Identifier Identifier

identifierA name used to distinguish the current resource from other resources

Uniform Uniform

Allow different kinds of resources to appear in the same context
Different kinds of resource identifiers can be interpreted using the same semantics
The introduction of new identifiers does not affect existing identifiers
Allows the same resource identifier in differentInternetScale in context

Ok, you talk about the components of a URI.

Components of a URI

Scheme: Indicates the name of the scheme or protocol that indicates how resources should be accessed

://: After scheme, there must be three specific characters ://

User :passwd@: indicates the identity information

Host :port: indicates the host name and port number

Path: indicates the location of the resource

Query: query parameter of the URI

Fragment: fragment identifier

It ‘s. it’s kind of interesting, but I’m probably too detailed to know.

Can you tell us what you’ve learned?

Scheme: The most common is HTTP, which indicates that HTTP is used. And HTTPS, which stands for the encrypted and secure HTTPS protocol. There are also some less common protocols such as FTP, news, and file. When the browser sees scheme in the URI, it will parse the URI according to the corresponding protocol. If the URI does not have Scheme, it will not do any work.

://: These three characters are used to separate scheme from the rest of scheme. Perhaps due to historical reasons, this design can only be accepted by us.

User :passwd@: The original design is to send the user name and password when logging in to the host. However, this method is no longer recommended, because it is exposed in plain text, which has serious security risks.

Host :port: The host name can be an IP address or a domain name, but it must be there. Otherwise, the browser will not find the server. The port number can be omitted. Browsers and other clients use the default port number based on Scheme, for example, the default HTTP port number is 80, and the default HTTPS port number is 443.

Path: This is similar to the file system directory path. The reason is that in the early days of the Internet, most UNIX systems were used, so the UNIX/style was adopted.

Query: With one? Start but not include? Represents an additional requirement for a resource. This seems to be a very graphic metaphor, which clearly expresses the action of the query. Of course, we have our own format for query, which is multiple key=value strings, then concatenated with &. The purpose is to allow browsers and servers to parse query parameters into dictionaries or arrays in this format.

Fragment: An anchor or tag is used to locate an internal resource in a URI. The browser can jump to the source after retrieving the resource. Personally, I think it’s a nice design, but fragments are only used by browsers, not servers.

So that’s it. You haven’t come back to the other question yet? Now let’s answer another question.

Why code ULR?

First of all, we need to know that only ASCII can be used in urIs, of course other ways can cause more problems, but we won’t go into that here.

What if there are reserved characters that are used as separators during data transfer?

For example 🌰 :

https://www.baidu.com/s?wd=?#! Cannot parse properly

? The separator; Fragment delimiter

https://www.baidu.com/s?wd= Digging for gold can be normally analyzed

https://www.baidu.com/s?wd= dig ‘> gold can not be properly analyzed

Encode potentially ambiguous data
- Encoding not in the ASCII range
- ASCII characters that cannot be displayed
- Reserved characters specified in the URI
- Unsafe characters () such as Spaces, quotes, Angle brackets

URI percent-sign encoding mode

pct-encoded = “%” HEXDIG HEXDIG
For HEXDIG hexadecimal letters, case and case are equivalent

In fact, the URI’s escape rules are a bit crude, converting non-ASCII and special characters directly to hexadecimal byte values, followed by a %.

Non-ascii characters (such as Chinese characters) : UTF8 and US-ASCII characters are recommended
For urI-legal characters, encoding and unencoding are equivalent
- www.baidu.com/s?wd=URI%20…

– www.baidu.com/s?wd=%55%52…

Here’s a breakdown of some of the more difficult parts:

ASCII:

128 characters (95 displayable and 33 undisplayable characters)
Reference address: zh.wikipedia.org/wiki/ASCII#…

Reserved characters:

reserved = gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" "$"/" &"/" "/" ("/") "/" * "/" + "/", "/";" / "="Copy the code

The Gen-delims character set is used to represent delimiters between URI components. Sub-delims are also needed to define delimiters between subcomponents, considering that components may have different subcomponents.

Non-reserved characters:

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" 
ALPHA = a-z / A-Z 
DIGIT = 0-9
Copy the code

You may also need to learn ABNF links

How do I code urIs?

I learned that historically there were three ways of decoding and coding:

escape / unescape
encodeURI / decodeURI
encodeURIComponent / decodeURIComponent

Escape has been removed from the standard so we won’t cover it.

EncodeURI () here I’ll focus on special cases:

encodeURIReplaces all characters except the following, even if they have the proper UTF-8 escape sequence

encodeURIIt doesn’t produce itself that can be applied toHTTP GET orPOSTThe request ofURI
forXMLHTTPReuestsBecause the"&"."+", and"="Are not encoded, however they are special characters in GET and POST requests. However,encodeURIComponent You can.

Special cases of encodeURIComponent:

EncodeURIComponent escapes all characters except those shown below

Unescaped characters: A-z a-z 0-9 - _.! ~ * '()Copy the code

For application/ X-www-form-urlencoded (POST) data, Spaces need to be replaced with ‘+’, and “%20” is often replaced with “+” when encodeURIComponent is used.

You can refer to the following code to actually experience the difference between the two:

var set1 = "; , /? : @ & = + $"; Var set2 = "-_.! ~ * '() "; Var set3 = "#"; Var set4 = "ABC ABC 123"; // Alphanumeric characters and Spaces console.log(encodeURI(set1)); / /; , /? :@&=+$ console.log(encodeURI(set2)); / / - _.! ~*'() console.log(encodeURI(set3)); // # console.log(encodeURI(set4)); // ABC%20abc%20123 (the space gets encoded as %20) console.log(encodeURIComponent(set1)); // %3B%2C%2F%3F%3A%40%26%3D%2B%24 console.log(encodeURIComponent(set2)); / / - _.! ~*'() console.log(encodeURIComponent(set3)); // %23 console.log(encodeURIComponent(set4)); // ABC%20abc%20123 (the space gets encoded as %20)Copy the code

Maybe this problem at the beginning, I have a little want to give up the feeling; Ha ha, maybe this is the charm of knowledge, we will update one after another, we can communicate with each other if there is any problem.

Finally, I will leave a few questions that I hope to find answers with you:

Do you know relative URIs?
How many forms of URI have you learned?

Reference links:

All programmers should know urIs. This article will help you understand them thoroughly

Perspective HTTP protocol

Web protocol details and packet capture

Why do I do URL encoding