To continue

Some time ago, someone in the company reported that part of the uploaded files could not be downloaded. Since our CDN uses Qiuniuyun, and the backend of the system in question uses Node.js, I thought it might be a bug in the official Node.js SDK of Qiuniu. First, I checked the names of the files that could not be downloaded, and found that the file names contained the special character “#”. So guess the problem may be URL encode, so check the nodejs-SDK source code found that the problem is here:

/ / https://github.com/qiniu/nodejs-sdk/blob/v7.1.1/qiniu/storage/rs.js#L651 BucketManager. Prototype. PublicDownloadUrl = function(domain, fileName) { return domain + "/" + encodeURI(fileName); }Copy the code

EncodeURI does not handle the special character “?” And “#”, so if the file name contains this special character it will cause the generated download link to be wrong, such as “hello#world.txt” :

PublicDownloadUrl ('http://example.com', 'hello#world.txt') // => http://example.com/hello#world.txt // http://example.com/hello%23world.txtCopy the code

Knowing that the problem is that special characters are not escaped, is it possible to escape them ahead of time? Since encodeURI escapes “%” characters, it is obvious that escaping ahead of time does not work either.

encodeURI('hello%23world.txt')
// => hello%2523world.txt
Copy the code

It’s obvious that fixing the problem will only give PR to qiniu’s SDK, but how to fix it?

  1. Change encodeURI to encodeURIComponent — reject (“/” also escaped)
  2. Write an encodePath — reject (fileName may also take an argument: “he#llo.jpg? Imageview2/2 /w/800”)
  3. Change the API to publicDownloadUrl(domain, fileName, query) — reject (too much change, need to upgrade the larger version)
  4. Change encodeURI to other functions to avoid repeating encode — feasible (small change, minimal impact)

So we have this PR: Safe encode URL

PS: In fact, it is better to consider API friendly or 3, or simply do not do encode and let the user process. But all this matters too much.

Now that the problem is solved, is this the end of this article? Ha ha, this is just the beginning, let’s really talk about URL Encode.

specification

The URL is always used because it is a specific Resource. Since the specification should use URI, see Uniform Resource Identifier

The following example contains the parts of the URI specification definition:

http://example.com:8042/over/there?name=ferret#nose
     \_/   \______________/\_________/ \_________/ \__/
      |           |            |            |        |
   scheme     authority       path        query   fragment
      |   _____________________|__
     / \ /                        \
     urn:example:animal:ferret:nose
Copy the code

According to the specification, the URI contains five parts:

1, the Scheme

scheme      = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
Copy the code

2, Authority (including UserInfo, Host, Port)

authority   = [ userinfo "@" ] host [ ":" port ]
userinfo    = *( unreserved / pct-encoded / sub-delims / ":" )
host        = IP-literal / IPv4address / reg-name
port        = *DIGIT
Copy the code

3, the Path

path          = path-abempty    ; begins with "/" or is empty
                / path-absolute   ; begins with "/" but not "//"
                / path-noscheme   ; begins with a non-colon segment
                / path-rootless   ; begins with a segment
                / path-empty      ; zero characters
Copy the code

4, the Query

query       = *( pchar / "/" / "?" )
Copy the code

5, fragments

fragment    = *( pchar / "/" / "?" )
Copy the code

Implement a URI Encode

Encode Encode Encode Encode Encode Encode Encode Encode Encode Encode Encode Encode Encode Encode Encode Encode

const TYPE = {
  SCHEME: 1,
  AUTHORITY: 2,
  USER_INFO: 3,
  HOST_IPV4: 4,
  HOST_IPV6: 5,
  PORT: 6,
  PATH: 7,
  PATH_SEGMENT: 8,
  QUERY: 9,
  QUERY_PARAM: 10,
  FRAGMENT: 11,
  URI: 12
};
Copy the code

Then implement several discriminant functions:

function isAlpha (c) { return (c >= 65 && c <= 90) || (c >= 97 && c <= 122); } function isDigit (c) { return (c >= 48 && c <= 57); } function isGenericDelimiter (c) { // :/? #[]@ return [58, 47, 63, 35, 91, 93, 64].indexOf(c) >= 0; } function isSubDelimiter (c) { // ! * + $& '(),; = return [33, 36, 38, 39, 40, 41, 42, 43, 44, 59, 61].indexOf(c) >= 0; } function isReserved (c) { return isGenericDelimiter(c) || isSubDelimiter(c); } function isUnreserved (c) { // -._~ return isAlpha(c) || isDigit(c) || [45, 46, 95, 126].indexOf(c) >= 0; } function isPchar (c) { // :@ return isUnreserved(c) || isSubDelimiter(c) || c === 58 || c === 64; }Copy the code

The implementation then determines whether a character needs to be escaped, and isAllow(char, type) is true to indicate that it does not.

function isAllow (c, type) { switch (type) { case TYPE.SCHEME: return isAlpha(c) || isDigit(c) || [43, 45, 46].indexOf(c) >= 0; // +-. case TYPE.AUTHORITY: return isUnreserved(c) || isSubDelimiter(c) || c === 58 || c === 64; // :@ case TYPE.USER_INFO: return isUnreserved(c) || isSubDelimiter(c) || c === 58; // : case TYPE.HOST_IPV4: return isUnreserved(c) || isSubDelimiter(c); case TYPE.HOST_IPV6: return isUnreserved(c) || isSubDelimiter(c) || [91, 93, 58].indexOf(c) >= 0; // []: case TYPE.PORT: return isDigit(c); case TYPE.PATH: return isPchar(c) || c === 47; // / case TYPE.PATH_SEGMENT: return isPchar(c); case TYPE.QUERY: return isPchar(c) || c === 47 || c === 63; / / /? case TYPE.QUERY_PARAM: return (c === 61 || c === 38)// =& ? false : (isPchar(c) || c === 47 || c === 63); / / /? case TYPE.FRAGMENT: return isPchar(c) || c === 47 || c === 63; / / /? case TYPE.URI: return isUnreserved(c); default: return false; }}Copy the code

Finally, implement encode function:

Const hexTable = new Array(128); // Const hexTable = new Array(128); for (let i = 0; i < 128; ++i) { hexTable[i] = '%' + ((i < 16 ? '0' : '') + i.toString(16)).toUpperCase(); } function encode (STR, type = type. URI) {if (! str) return str; let out = ''; let last = 0; for (let i = 0; i < str.length; ++i) { const c = str.charCodeAt(i); // ASCII if (c < 0x80) { if (last < i) { out += encodeURIComponent(str.slice(last, i)); } if (isAllow(c, type)) { out += str[i]; } else { out += hexTable[c]; } last = i + 1; } } if (last < str.length) { out += encodeURIComponent(str.slice(last)); } return out; }Copy the code

With this simple URI Encode implemented, we can encapsulate several more utility functions:

function encodeScheme (scheme) {
  return encode(scheme, TYPE.SCHEME);
}
function encodeAuthority (authority) {
  return encode(authority, TYPE.AUTHORITY);
}
function encodePath (path) {
  return encode(path, TYPE.PATH);
}
function encodeQuery (query) {
  return encode(query, TYPE.QUERY);
}
// ...
Copy the code

Let’s test it out:

encodePath('/foo bar#? .js') // => /foo%20bar%23%3F.jsCopy the code

The last

Finally, we found that NPM doesn’t have a library, so we published it on NPM: uri-utils. The detailed source code is on Github: Uri-utils, which includes unit tests and benchmarks.

Benchmark results:

node version: V4.8.7 URi-utils x 150,915 OPS/SEC ±1.08% (89 runs) encodeURIComponent X 112,777 OPS/SEC ±1.29% (73 runs sampled) Fastest is uri-utils node version: V6.12.3 URi-utils x 60 ops/ SEC ±0.55% (75 runs) encodeURIComponent X 60 OPS/SEC ±0.55% (75 runs) encodeURIComponent X 60 OPS/SEC ±0.55% (77 runs)  Fastest is uri-utils node version: V8.9.4 URi-utils x 155,020 OPS/SEC ±5.58% (75 runs) encodeURIComponent X 612,347 OPS/SEC ±4.05% (83 runs sampled) Fastest is encodeURIComponentCopy the code