Why is the url a string of gibberish after opening the picture of the Chinese name? Why does a good short url copy and paste become a long long string? The culprit was…
Hangzhou is finally out! Two dogs to see the weather station released the news, very happy. It has been raining in Hangzhou since the end of May, and the sky is grey with rain and rain every day, which almost makes people depressed. Plum, sunny, two dog’s heart also with the blue sky white clouds drift out.
Colleagues xiao Fengfeng told two dogs, the queen’s Park behind the company, sunflower just open, a large stretch out, golden in the sun, can be good-looking. “Want to go and see? Please take some photos for Tingting.”
Two dogs think of the goddess Tingting like sunflower, take some beautiful photos, can see Tingting sweet smile, can’t wait to pull small fengfeng ran to the park.
Two dog son took a lot of photos, carefully selected several pieces, uploaded to their picture website. He opened the picture in his browser and looked at it. There was no problem and it was perfect. So he copied the picture address and sent it to Tingting.
Two dog son a little embarrassed, clearly in the browser is the normal display of THE URL address, how to copy it into a pile of strange characters?
Two dogs puzzled, or ask their own storage service providers – omnipotent and shoot cloud.
Customer service Joe received two dogs, and told the two dogs the origin of the problem.
URL – Network resource locator
When accessing resources on the Network through the Internet, the most common way to access resources is to enter the URL of the resource through the browser.
Uniform Resource Locator (URL) is a core concept on the Internet. Simply put, a URL is an address assigned to a resource on the Internet by a website developer. Typically, each valid URL points to a single resource, which can be an HTML page, a CSS document, an image, etc.
A URL is made up of different parts, some of which are required and some optional. Let’s look at the specific components of the URL:
The image above shows the complete URL structure. Many times some of these parts are not needed, such as User Information. As a reference, we can take a look at the URL address of the cloud storage.
www.upyun.com/products/fi…
-
Https://, the request protocol (Scheme), specifies which protocol the browser needs to use to communicate with the target server. Common protocols are HTTP and HTTPS.
-
www.upyun.com, domain name (host), indicating the server address of the requested resource.
-
/products/file-storage, resource path (path), by which the server determines the location on the server to access resources.
A common URL address consists of these three parts, and the rest can be customized according to development requirements.
Knowledge of the concepts of the URL, I knew two dog child picture link fileupload-upyun.test.upcdn.net/images/ sunflower 1… The origin of. Through this address, Tingting can access the sunflower pictures taken on erdogi server. But why two dogs, copy the address in the browser address bar, sent to tingting, but URL into fileupload-upyun.test.upcdn.net/images/%E5%… ?
Strange character-URL encoding
We can see that the link erguzi sent to Tingting has changed part of the PATH part of the URL. Moreover, the English part has not been changed, only the Chinese part has been converted into the encoding format of %XX.
Although this does not affect the image opening, the address is still valid. But why do browsers translate Chinese into this strange form?
Let’s start with an example. If you go to the following URL:
www.baidu.com/s?wd=?#!
This is a link to search using Baidu, followed by /s? Query represents the request parameters (query), that is, we want to submit some parameters to the requested server. Wd is the search parameter specified by Baidu, and wd is followed by the content to be searched.
We want to search? #! This content, but when you copy this link to open in the browser, you will find a problem, Baidu is just a search? This content, #! Out of sight.
Why is that? If you look closely at the URL structure above, you will see that there is also a fragment in the URL structure with the # delimiter.
So there is a problem here. Our business requirement is to search for # as plain text, but # has a specific meaning in the URL, so the browser has a problem interpreting ambiguity.
This raises a question: how to deal with reserved characters used as delimiters and other special functions in URL data transmission?
In actual business scenarios, some ambiguous data in URLS are often encountered. In order to avoid interpretation errors, developers come up with a solution that is to process these data to solve the ambiguity problem. There are many processing methods, and the most common one is to encode the ambiguous data with URL percent sign.
What are the ambiguities?
According to the RFC 3986 (tools.ietf.org/html/rfc398…
-
Characters that are not in the ASCII range (urls are encoded using ASCII)
-
Undisplayable character in ASCII code
-
Reserved characters specified in the URL
-
Unsafe characters, such as Spaces, quotes, Angle brackets, and so on, that may be handled incorrectly during transmission
Reserved characters are made up of component delimiters (gen-delims) and sub-component delimiters (sub-delims), which have special meaning in the URL:
Reserved = gen-delims / sub-delims
-
gen-delims = “:” / “/” / “?” / “/ “[“/”]”/”@”
-
sub-delims = “!” “$”/” &”/” “/” (“/”) “/” * “/” + “/”, “/”;” / “=”
Non-reserved characters that can be used directly in a URL are:
Unreserved = ALPHA / DIGIT / “-” / “.” / “_” / “~”
-
ALPHA: %41 – %5A AND %61 – %7A
-
DIGIT: %30 – %39
-
-: %2D .: %2E _: %5F
-
~: %7E (some service implementations use this as a reserved character and generally require encoding)
This explains why strange characters sometimes appear in urls, and many times the browser encodes and decodes them for us. Just like the “sunflower” URL, the Chinese part is not in ASCII, so the browser uses URL percent encoding to access it.
(The legal scope of the cloud service name is obtained in non-reserved characters, and cannot be specified as arbitrary characters)
Form of code
The most common form of encoding is pcT-encoded, which is the default for browsers.
pct-encoded = “%” HEXDIG HEXDIG
The percent sign encodes a reserved character by first representing the ASCII value of the character as two hexadecimal values, followed by an escape character (%). For non-ASCII characters, you need to convert to UTF-8 byte order, and then each byte is represented as described above.
(Those who are interested can refer to the percent code zh.wikipedia.org/zh-hans/%E7… Item) There are also corresponding methods for percent coding in each programming language. For example, JavaScript provides a variety of coding methods, including encodeURI and encodeURIComponent.
Joe said, in daily development use, we use the development of the library for URL encoding standard of judgment may not be the same, this is because these libraries faced by the network environment, the special character of safe treatment strategies have their own judgment, it will also cause and if there are some special characters in the URL, in the process of development, access may be there will be some strange questions. It is also recommended that you use non-reserved characters to design the URL part of your program to avoid some unnecessary bugs.
After listening to Joe ba’s introduction, two dogs finally understand why the URL appears in these strange characters. Hey hey, this can explain to Tingting why the URL is so long.
Recommended reading
Jump to 301 and talk about the tricks of the edge rule
How can an enterprise efficiently and smoothly implement data migration