Abstract

This paper mainly introduces the related problems of URI encoding and decoding, and explains in detail what characters need to be encoded in URL encoding and why they need to be encoded. Escape/unescape,encodeURI/decodeURI and encodeURIComponent/decodeURIComponent are compared and analyzed.

Preliminary knowledge

foo://example.com:8042/over/there?name=ferret#nose \/ \________/ \_____/\_____/ \/ | | | | | scheme authority path query fragment

A URI stands for uniform Resource Identification (URI), and usually a Url is just one type of URI. The format of a typical Url is shown above. The Url encoding mentioned below should actually refer to URI encoding.

Why do I need Url encoding

Usually if something needs to be encoded, it’s not suitable for transmission. For example, the Size is too large and contains private data. For THE Url, it is necessary to encode because some characters in the Url will cause ambiguity.

For example, the Url parameter string uses the key=value pair, which is separated by ampersand, such as /s? Q = abc&ie = utf-8. If your value string contains = or &, the server that received the Url will parse incorrectly, so the & and = symbols that are ambiguous must be escaped, that is, encoded.

For example, urls are encoded in ASCII, not Unicode, which means you can’t include any non-ASCII characters in urls, such as Chinese. Otherwise, Chinese may cause problems if the client and server browsers support different character sets.

The principle of Url encoding is to use safe characters (printable characters with no special purpose or meaning) to represent unsafe characters.

Which characters need to be encoded

According to RFC3986, a Url can contain only special characters (a-za-z), digits (0-9), and -_.

The RFC3986 document makes detailed recommendations on Url codec, indicating which characters need to be encoded so as not to cause semantic changes in the Url, and explaining why these characters need to be encoded.

There is no corresponding printable character in the US-ASCII character set

Only printable characters are allowed in the Url. The 10-7f bytes in US-ASCII all represent control characters, none of which can appear directly in a Url. Also, 80-FF bytes (ISO-8859-1) cannot be placed in urls because they are outside the range of bytes defined by US-ACII.

Reserved characters

Urls can be divided into several components, protocol, host, path, and so on. There are some characters (:/? #[]@) is used to separate components. For example, the colon is used to separate protocols and hosts, and the/is used to separate hosts and paths. Used to separate paths from query parameters, and so on. There are also characters (! * + $& ‘(),; =) is used as a separator in each component, such as = to indicate key-value pairs in query parameters and ampersand to separate multiple key-value pairs in a query. When ordinary data in a component contains these special characters, they need to be encoded.

RFC3986 specifies the following characters as reserved characters:

! * ( ) ; : @ & = + $ . / ? # [ ]

Unsafe character

There are also some characters that can cause ambiguity in the parser when placed directly in the Url. These characters are considered unsafe for a number of reasons.

The blank space When a Url is in transit, or when the user is typesetting, or when a text processor is processing a Url, it is possible to introduce irrelevant white space or remove meaningful white space
Quotes and <> Quotation marks and Angle brackets are often used to separate urls in plain text
# Usually used to indicate a bookmark or anchor point
% The percent sign itself is used as a special character for encoding insecure characters, and therefore requires encoding itself
{} | \ ^ ` ~ [] Some gateway or transport agent will tamper with these characters

 

It is important to note that encoding and unencoding are equivalent for legal characters in A Url, but for the characters mentioned above, if they are not encoded, they may cause semantic differences in the Url. Therefore, for urls, only common English characters and numbers, special characters $-_.+! *'() also has reserved characters to appear in unencoded urls. Other characters must be encoded to appear in the Url.

However, due to historical reasons, there are still some non-standard coding implementations. For example, for the ~ symbol, although the RFC3986 document states that for the tilde ~, Url encoding is not required, many older gateways or transport agents do

How to encode invalid characters in Url

Url Encoding is often called Url Encoding (also known as percent-encoding) because it is so simple, Use the % percent sign followed by a two-digit character — 0123456789ABCDEF — to represent a single byte in hexadecimal form. The default character set used for Url encoding is US-ASCII. For example, if the byte a corresponds to in us-ASCII is 0x61, then the Url is encoded to produce %61. Typing http://g.cn/search?q=%61%62%63 in the address bar is essentially the same as searching for ABC on Google. For example, the @ symbol in the ASCII character set corresponds to the byte 0x40, after Url encoding is %40.

List of Url encodings for common characters:

! * ( ) ; : @ &
% 21 %2A % 22 % 27 % 28 % 29 %3B %3A % 40 % 26
= + $ . / ? % # [ ]
%3D %2B % 24 %2C %2F %3F % 25 % 23 %5B %5D

For non-ASCII characters, the corresponding bytes need to be encoded using a superset of the ASCII character set, and percent-sign encoding is performed for each byte. For Unicode characters, the RFC documentation recommends using UTF-8 to encode the corresponding bytes, and then performing percentage-sign encoding for each byte. For example, if the byte of Chinese is 0xE4 0xB8 0xAD 0xE6 0x96 0x87 when the UTF-8 character set is used, the Url encoding is %E4%B8%AD%E6%96%87.

If a byte corresponds to an unreserved character in the ASCII character set, the byte need not be represented by a percent sign. For example, for Url encoding, the utF-8 encoding is 0x55 0x72 0x6C 0xE7 0xBC 0x96 0xE7 0xA0 0x81. Since the first three bytes correspond to the non-reserved ASCII character Url, So these three bytes can be represented by the unreserved character “Url”. The final Url encoding can be simplified to “Url%E7%BC%96%E7%A0%81”, of course, if you use “%55%72%6C%E7%BC%96%E7%A0%81 “.

For historical reasons, there are some URL-encoding implementations that do not fully follow this principle, as discussed below.

Escape,encodeURI, and encodeURIComponent in Javascript

Javascript provides three pairs of functions to encode urls to get a valid Url, They are escape/Unescape,encodeURI/decodeURI, and encodeURIComponent/decodeURIComponent. Since the decoding and encoding process is reversible, only the encoding process is explained here.

These three encoded functions — escape, encodeURI, and encodeURIComponent — are all used to convert unsafe and invalid Url characters into valid Url character representations, with several differences.

Different security characters

The following table lists the safe characters for these three functions (that is, they are not encoded by the function)

  The security characters
Escape (69) */@+-._0-9a-zA-Z
EncodeURI (82) ! # $& ‘() *, + / :; =? @-._~0-9a-zA-Z
EncodeURIComponent (71 PCS) ! ‘()*-._~0-9a-zA-Z

Different compatibility

The escape function has been around since Javascript1.0, and the other two functions were introduced in Javascript1.5. However, since Javascript1.5 is so ubiquitous, there are actually no compatibility issues with encodeURI and EncodeUricoment.

Unicode characters are encoded differently

All three functions encode ASCII characters in the same way, using a percentage sign plus two hexadecimal characters. But for Unicode characters, escape is encoded as %uxxxx, where XXXX is the 4-bit hexadecimal character used to represent Unicode characters. This approach has been deprecated by the W3C. However, the ECMA-262 standard still retains this encoding syntax. EncodeURI and encodeURIComponent use UTF-8 to encode non-ASCII characters, followed by percent encoding. This is recommended by the RFC. Therefore, it is recommended that these two functions be used instead of escape whenever possible.

Different application occasions

EncodeURI is used to encode a complete URI, and encodeURIComponent is used to encode a component of the URI.

From the table of safe character ranges mentioned above, we can see that encodeURIComponent encodes a larger character range than encodeURI. As we mentioned above, reserved characters are typically used to separate URI components (a URI can be split into multiple components, see The Prep section) or subcomponents (such as the delimiter for query parameters in urIs), such as: the sign is used to separate scheme from the host,? The id is used to separate hosts and paths. Since the object that encodeURI manipulates is a full URI, these characters have a special purpose in the URI, so these reserved characters are not encoded by encodeURI, otherwise the meaning would change.

Components have their own internal data representation format, but this data cannot contain reserved characters that separate components, otherwise it will result in messy separation of components in the overall URI. So using encodeURIComponent for a single component requires encoding more characters.

The form submission

When an Html form is submitted, each field is URL-encoded before it is sent. For historical reasons, the URL-encoding implementation used by forms does not conform to the latest standards. For example, the code for Spaces is not %20, but +, and if the form is submitted using Post, we can see a Content-Type header in the HTTP header with a value of Application/X-www-form-urlencoded. Most applications can handle this non-standard IMPLEMENTATION of Url encoding, but in client-side Javascript, there is no function that can decode the + sign into a space, so you have to write your own conversion function. Also, for non-ASCII characters, the coded character set used depends on the character set used by the current document. For example, we add to the Html header

<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
Copy the code

The browser will then render the document using GB2312 (note that if the meta tag is not set in the HTML document, the browser will automatically select the character set based on the current user’s preference, or the user can force the current site to use a specified character set). When submitting the form, the CHARACTER set used for Url encoding is GB2312.

Does the document character set affect encodeURI?

One of the most confusing issues I encountered with Aptana was when I used encodeURI and found that the result was not what I expected. Here is my sample code:

<! PUBLIC DOCTYPE HTML "- / / / / W3C DTD XHTML 1.0 Transitional / / EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" > <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; Charset =gb2312" /> </head> <body> <script type="text/javascript"> document.write(encodeURI("中文")); charset=gb2312" /> </head> <body> <script type="text/javascript"> document.write(encodeURI("中文")); </script> </body> </html>Copy the code

Run result %E6%B6%93%EE%85%9F%E6%9E%83. Obviously this is not the result of Url encoding using the UTF-8 character set (search for “Chinese” on Google and the Url shows %E4%B8%AD%E6%96%87).

So I wondered if encodeURI had anything to do with page encoding, but I found that normally you wouldn’t get this result if you used GB2312 for Url encoding. I finally found out that the character set used for the page file storage was inconsistent with the character set specified in the Meta tag. Aptana’s editor uses the UTF-8 character set by default. This means that the file is actually stored using the UTF-8 character set. If the Meta tag is gb2312, the browser will parse the document according to GB2312. Because the “Chinese” string encoded with UTF-8 is 0xE4 0xB8 0xAD 0xE6 0x96 0x87 as bytes as others, these six bytes are decoded by the browser as GB2312, it is as following three other Characters as “trick- as The result of passing these three characters into encodeURI is %E6%B6%93%EE%85%9F%E6%9E%83. Therefore, encodeURI still uses UTF-8 and is not affected by the page character set.

Other issues related to Url encoding

Different browsers have different behavior when dealing with urls containing Chinese characters. For example, if you select the advanced Settings “always send Url in UTF-8”, the Chinese part of the path in the Url will be sent to the server using UTF-8 Url encoding, and the Chinese part of the query parameter will be Url encoded using the default character set of the system. To ensure maximum interoperability, it is recommended that all components placed in urls explicitly specify a character set for Url encoding, independent of the browser’s default implementation.

In addition, many HTTP monitoring tools or browser address bars will automatically decode the Url (using the UTF-8 character set) when displaying the Url, which is why when you search for Chinese in Google in Firefox, the Url displayed in the address bar will contain Chinese. But the original Url sent to the server is actually encoded. You can see this by using Javascript to access location.href from the address bar. Don’t be fooled by these illusions when researching Url codecs.

reference

  • Wikipedia: percent code
  • CSDN:blog.csdn.net/libertine19…
  • Sina blog: blog.sina.com.cn/s/blog_7299…