URIError: URI malformed 500 Error This error is triggered only when a string encoded with invalid characters, such as encodeURI, fails to be decoded

The user data that triggers the error is as follows:

“The days passed like water. It flows by us quietly, without a sound. Come on in the future, don’t let it become a burden.”

[” On the occasion of the Mid-Autumn Festival, we wish you a happy holiday! “, “�🎉🎉”]

“The other shore flowers bloom the other shore, heartbroken hastily broken liver bowel 🌺🌺�”

To explore the process

At first glance, emoji may seem like a suspect, but it’s actually perfectly normal

< > encodeURI (' 🌸 ' '% F0 B8 8 c % 9 f % %'Copy the code

even

> encodeURI () '�' < 'EF % % BF % BD'Copy the code

Since these data can normal encode, so the source of error is the real appearance of these messy code, so we have to figure out how to cause the messy code

Looking back at the encodeURL error case, the official example is shown below

As we all know, Unicode contains both basic and extended characters. Most common characters are in the base plane of Unicode (65536 characters), represented in UTF-16 as 2 bytes (\uXXXX).

Special characters outside the basic plane (supplementary plane), such as hieroglyphics (𓀀), cuneiform (𒆠), emoji (💩), etc., need to be represented by 4 bytes (\uXXXX \uXXXX) such as: 🌸 => \ UD83c \ UDf38

In addition, Unicode for supplementary characters is assigned a special range, corresponding to utF-16 encodings. The first two bytes range from U+D800 to U+DBFF, and the last two bytes range from U+DC00 to U+DFFF. Therefore, the high and low encodings are separated separately and are not considered as a basic character. \ud83c => �

Knowing this, the bare stones began to surface, and the gibberish in the user data that triggered the error in the log was the missing emoji code

But why would a user enter only half the emoji code? Such frequent triggers are clearly unlikely

The bottom of

As mentioned earlier, Unicode base characters are represented by 2 bytes in UTF-16, and supplementary characters are represented by 4 bytes

However, Javascript was born a year before UTF-16 was released, so Javascript could only be encoded using the now-obsolete UCS-2, which led it to assume that all characters in the language were 2 bytes, and for Unicode characters that complement the plane, It is only treated as 2 characters, so the following happens:

> "\ud83c\ UDF38 "< "🌸" > "🌸". Length < 2Copy the code

At this time, I looked back at the log data and found that all the strings with problems were exactly 20, 50 and other whole ten lengths, and then returned to the relevant business code. Sure enough, I found that all the data input by users had a length limit

The answer became clear: A certain text input box limits the maximum input of 100 characters. However, after the user enters the 99th character, the 100th character is entered into emoji. At this time, after the js truncation process, the user will see that the word count is full. However, the last emoji entered is not displayed (but the emoji’s high encoding is still in the string), so the user will receive an error if they click submit

As shown in the GIF below (note the change in word count)

The solution

The need to limit the length of user input text is too common in business to be improved in string truncation processing in business everywhere, so the filter of incomplete encoding is chosen uniformly before encodeURI

As mentioned earlier, the encoding range of supplementary flat characters: the first two bytes range is [\uD800-\uDBFF] ~ [\uDC00-\uDFFF].

So there are four cases that regular expressions need to cover (without considering high-low inversion) :

  • The high character appears alone:[\uD800-\uDBFF][^\uDC00-\uDFFF],[\uD800-\uDBFF]$
  • Low order characters appear alone:[^\uD800-\uDBFF][\uDC00-\uDFFF],^[\uDC00-\uDFFF]
/** * Finds incomplete character encoding positions *@param {string} str
  * @returns {number}* /
findInvalidUnicode: function(str) {
  const reg = /([\uD800-\uDBFF])[^\uDC00-\uDFFF]|^([\uDC00-\uDFFF])|([\uD800-\uDBFF])$|[^\uD800-\uDBFF]([\uDC00-\uDFFF])/g.exec(str);
  if (reg) {
    let index = reg.index;
    if (reg[4]) {
      return index + 1;
    }
    return index;
  }
  return -1;
}
Copy the code

Refer to the article

  • Javascript has a Unicode sinkhole
  • Unicode and JavaScript in detail