The body of the
When you see this question, will not have tears in your eyes, the topic of deep love it. I recited the topic for a long time was finally asked, but the reality is always so cruel, I was asked about the scene is like this:
Interviewer: Can you talk specifically about what happened from the time you typed the URL to the page presentation?
Interviewer: I have a deep understanding of this block.
Interviewer: I know you can’t tell me everything, but just tell me what you know.
Interviewer: How do you know I can’t tell you everything?
Interviewer: First, the browser will parse the entered URL
Then began a soul-searching:
Why do YOU need to encode urls, by the way urIs and urls
Here’s a simple breakdown of the problem:
The URI and URL
First of all, let’s make a definition:
URL: RFC1738 Uniform Resource Locator
URN: RFC2141 Uniform Resource Name
URI: RFC1630 Uniform Resource Identifier
In fact, from such an official language, I actually do not understand what has been done? I just looked at the official explanation
URL: Commonly known as a web address, is the address of a standard resource on the Internet. Expect to provide a way to find the resource
URN: Expects to provide a persistent, location-independent identification of a resource. And allows multiple namespaces to be easily mapped to a single URN namespace
Uris are used to distinguish resources and are supersets of urls and URNs to replace the concepts of URLS and UrNs
Uris or URLS you can specify
Here I’ll explain the URI first, and then parse each part step by step.
The Resource resources
-
A Resource can be an image, document, or information. It can also be an entity that cannot be accessed through the network. It can also be some abstract concepts.
-
A resource can have multiple URIs
Identifier Identifier
identifier
A name used to distinguish the current resource from other resources
Uniform Uniform
- Allow different kinds of resources to appear in the same context
- Different kinds of resource identifiers can be interpreted using the same semantics
- The introduction of new identifiers does not affect existing identifiers
- Allows the same resource identifier in different
Internet
Scale in context
Ok, you talk about the components of a URI.
Components of a URI
Scheme: Indicates the name of the scheme or protocol that indicates how resources should be accessed
://: After scheme, there must be three specific characters ://
User :passwd@: indicates the identity information
Host :port: indicates the host name and port number
Path: indicates the location of the resource
Query: query parameter of the URI
Fragment: fragment identifier
It ‘s. it’s kind of interesting, but I’m probably too detailed to know.
Can you tell us what you’ve learned?
Scheme: The most common is HTTP, which indicates that HTTP is used. And HTTPS, which stands for the encrypted and secure HTTPS protocol. There are also some less common protocols such as FTP, news, and file. When the browser sees scheme in the URI, it will parse the URI according to the corresponding protocol. If the URI does not have Scheme, it will not do any work.
://: These three characters are used to separate scheme from the rest of scheme. Perhaps due to historical reasons, this design can only be accepted by us.
User :passwd@: The original design is to send the user name and password when logging in to the host. However, this method is no longer recommended, because it is exposed in plain text, which has serious security risks.
Host :port: The host name can be an IP address or a domain name, but it must be there. Otherwise, the browser will not find the server. The port number can be omitted. Browsers and other clients use the default port number based on Scheme, for example, the default HTTP port number is 80, and the default HTTPS port number is 443.
Path: This is similar to the file system directory path. The reason is that in the early days of the Internet, most UNIX systems were used, so the UNIX/style was adopted.
Query: With one? Start but not include? Represents an additional requirement for a resource. This seems to be a very graphic metaphor, which clearly expresses the action of the query. Of course, we have our own format for query, which is multiple key=value strings, then concatenated with &. The purpose is to allow browsers and servers to parse query parameters into dictionaries or arrays in this format.
Fragment: An anchor or tag is used to locate an internal resource in a URI. The browser can jump to the source after retrieving the resource. Personally, I think it’s a nice design, but fragments are only used by browsers, not servers.
So that’s it. You haven’t come back to the other question yet? Now let’s answer another question.
Why code ULR?
First of all, we need to know that only ASCII can be used in urIs, of course other ways can cause more problems, but we won’t go into that here.
- What if there are reserved characters that are used as separators during data transfer?
For example 🌰 :
https://www.baidu.com/s?wd=?#! Cannot parse properly
? The separator; Fragment delimiter
https://www.baidu.com/s?wd= Digging for gold can be normally analyzed
https://www.baidu.com/s?wd= dig ‘> gold can not be properly analyzed
-
Encode potentially ambiguous data
-
Encoding not in the ASCII range
-
ASCII characters that cannot be displayed
-
Reserved characters specified in the URI
-
Unsafe characters () such as Spaces, quotes, Angle brackets
-
URI percent-sign encoding mode
- pct-encoded = “%” HEXDIG HEXDIG
- For HEXDIG hexadecimal letters, case and case are equivalent
In fact, the URI’s escape rules are a bit crude, converting non-ASCII and special characters directly to hexadecimal byte values, followed by a %.
-
Non-ascii characters (such as Chinese characters) : UTF8 and US-ASCII characters are recommended
-
For urI-legal characters, encoding and unencoding are equivalent
- www.baidu.com/s?wd=URI%20…
– www.baidu.com/s?wd=%55%52…
Here’s a breakdown of some of the more difficult parts:
ASCII:
-
128 characters (95 displayable and 33 undisplayable characters)
-
Reference address: zh.wikipedia.org/wiki/ASCII#…
Reserved characters:
reserved = gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" "$"/" &"/" "/" ("/") "/" * "/" + "/", "/";" / "="Copy the code
The Gen-delims character set is used to represent delimiters between URI components. Sub-delims are also needed to define delimiters between subcomponents, considering that components may have different subcomponents.
Non-reserved characters:
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
ALPHA = a-z / A-Z
DIGIT = 0-9
Copy the code
You may also need to learn ABNF links
How do I code urIs?
I learned that historically there were three ways of decoding and coding:
-
escape / unescape
-
encodeURI / decodeURI
-
encodeURIComponent / decodeURIComponent
Escape has been removed from the standard so we won’t cover it.
EncodeURI () here I’ll focus on special cases:
encodeURI
Replaces all characters except the following, even if they have the proper UTF-8 escape sequence
encodeURI
It doesn’t produce itself that can be applied toHTTP GET
orPOST
The request ofURI
- for
XMLHTTPReuests
Because the"&"
."+"
, and"="
Are not encoded, however they are special characters in GET and POST requests. However,encodeURIComponent
You can.
Special cases of encodeURIComponent:
-
EncodeURIComponent escapes all characters except those shown below
Unescaped characters: A-z a-z 0-9 - _.! ~ * '()Copy the code
-
For application/ X-www-form-urlencoded (POST) data, Spaces need to be replaced with ‘+’, and “%20” is often replaced with “+” when encodeURIComponent is used.
You can refer to the following code to actually experience the difference between the two:
var set1 = "; , /? : @ & = + $"; Var set2 = "-_.! ~ * '() "; Var set3 = "#"; Var set4 = "ABC ABC 123"; // Alphanumeric characters and Spaces console.log(encodeURI(set1)); / /; , /? :@&=+$ console.log(encodeURI(set2)); / / - _.! ~*'() console.log(encodeURI(set3)); // # console.log(encodeURI(set4)); // ABC%20abc%20123 (the space gets encoded as %20) console.log(encodeURIComponent(set1)); // %3B%2C%2F%3F%3A%40%26%3D%2B%24 console.log(encodeURIComponent(set2)); / / - _.! ~*'() console.log(encodeURIComponent(set3)); // %23 console.log(encodeURIComponent(set4)); // ABC%20abc%20123 (the space gets encoded as %20)Copy the code
Maybe this problem at the beginning, I have a little want to give up the feeling; Ha ha, maybe this is the charm of knowledge, we will update one after another, we can communicate with each other if there is any problem.
Finally, I will leave a few questions that I hope to find answers with you:
- Do you know relative URIs?
- How many forms of URI have you learned?
Reference links:
All programmers should know urIs. This article will help you understand them thoroughly
Perspective HTTP protocol
Web protocol details and packet capture
Why do I do URL encoding