A, hit the pit

Two days ago, I met a problem in the project. After the message sent was encrypted and decrypted to the Web end, garbled characters appeared. The strange part of the problem is normal at the beginning, the mobile terminal did not modify the code, the background did not modify the code, the front-end also said that there was no modification of the code, but the supernatural bug is always unable to solve. Later, after breakpoint debugging, I found that the mobile end encryption/decryption and the Web end encryption/decryption logic is completely consistent, the main problem lies in the coding, strictly speaking, is the escape of/and other special symbols above, so I in-depth understanding of the following coding.

Network standard RFC 1738 stipulates:

“… Only alphanumerics [0-9a-zA-Z], the special characters “$-_.+! (*),”

Only letters and numbers [0-9a-za-z], some special symbols “$-_.+! *'(), “[excluding double quotes], and reserved words can be used unencoded in urls

Second, filling holes

After finding the bug, I began to think about encoding ciphertext. At first, URLEncoder. Encode (“****”,” UTF-8 “) was used to encode ciphertext, but after encoding, I found that the ciphertext after encoding was exactly the same as the ciphertext before encoding. Talking to a colleague about this phenomenon, I guess it is because of inconsistent javascript encoding. So I started exploring the coding method used on the Web side — Escape

escapefunction

JavaScript escape () function

This method does not encode ASCII letters and numbers, or the following ASCII punctuation marks: * @ – _ +. /. All other characters are replaced by escape sequences.

The escape() encoded string can be decoded using unescape().

When I was trying to find the Java version of the escape function on the Web, I came across an idea:

  • Digits, uppercase and lowercase letters are not processed
  • For each non-Chinese character special symbol (ASCIIValue less than 256)%, and then turn the symbolASCIIValue and is displayed in hexadecimal format.
  • Add to each Chinese front%uAnd then turn to the textASCIIValue and is displayed in hexadecimal format. For example, change “middle” to%u4E2D

This idea seriously violates the definition of the above document, does not conform to the description of the above document, and the test fails.

Looking further, there is a Java version of the Escape utility class on the web that is very popular. According to the first lazy rule, copy the code into the project and test if it works. I found a problem when I was about to shut down the project. Just to keep things in suspense, let’s see how this differs from the above method:

  • Combine numbers, upper and lower case letters as well[' - ', '_', '. ', '. ', '-', '*', '/', '(',') ']Special characters in arrays are not handled
  • Convert space to+
  • For each non-Chinese character special symbol (ASCIIValue less than 128)%, and then turn the symbolASCIIValue and is displayed in hexadecimal format. Such asRMBto%A5
  • Add to each Chinese front%uAnd then turn to the textASCIIValue and is displayed in hexadecimal format. For example, change “middle” to%u4E2D

There are only 7 special symbols in the definition, and there is no explanation for Spaces and + signs, so I tested a wave, and found that it is still wrong:

escape("")  // Run result: "%20"
Copy the code

So I changed it again, this time just according to the JavaScript escape() function definition, resulting in the following code:

private fun escape(src: String): String {
    var i = 0
    var j: Char
    val tmp = StringBuffer()
    tmp.ensureCapacity(src.length * 6)
    while (i < src.length) {
        j = src[i]
        when {
            Character.isDigit(j)
                    || Character.isLowerCase(j)
                    || Character.isUpperCase(j)
                    || specialSymbols.contains(j) -> tmp.append(j)
            j.toInt() < 128 -> {
                tmp.append("%")
                if (j.toInt() < 16) tmp.append("0")
                tmp.append(j.toInt().toString(16))
            }
            else -> {
                tmp.append("%u")
                tmp.append(j.toInt().toString(16))
            }
        }
        i++
    }
    return tmp.toString()
}
Copy the code

Since there is a code, there must be a corresponding decoding method, this is also very simple:

fun unescape(src: String): String {
    val tmp = StringBuffer()
    tmp.ensureCapacity(src.length)
    var lastPos = 0
    var pos = 0
    var ch: Char
    while (lastPos < src.length) {
        pos = src.indexOf("%", lastPos)
        if (pos == lastPos) {
            when {
                src[pos + 1] == 'u' -> {
                    ch = src
                        .substring(pos + 2, pos + 6).toInt(16).toChar()
                    tmp.append(ch)
                    lastPos = pos + 6
                }
                else -> {
                    ch = src
                        .substring(pos + 1, pos + 3).toInt(16).toChar()
                    tmp.append(ch)
                    lastPos = pos + 3
                }
            }
        } else {
            lastPos = if (pos == -1) {
                tmp.append(src.substring(lastPos))
                src.length
            } else {
                tmp.append(src.substring(lastPos, pos))
                pos
            }
        }
    }
    return tmp.toString()
}
Copy the code

Third, to reflect on

Logically speaking, this time should be able to fish, but there is a question deeply confused me, why there is no unified method of decoding/coding? Is there really a discrepancy between Java and javascript encoding? So I opened URLEncoder. Encode source code:

/** * Translates a string into {@code application/x-www-form-urlencoded} * format using a specific encoding scheme. This  method uses the * supplied encoding scheme to obtain the bytes for unsafe * characters. * <p> * <em><strong>Note:</strong> The <a href= * "http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars"> * World Wide Web Consortium Recommendation</a> states that * UTF-8 should be used. Not doing so may introduce * incompatibilities.</em> * * @param s {@code String} to be translated. * @param enc The name of a supported * <a href=".. /lang/package-summary.html#charenc">character * encoding</a>. * @return the translated {@code String}. * @exception UnsupportedEncodingException * If the named encoding is not supported * @see URLDecoder#decode(java.lang.String, Java.lang.String) * @since 1.4 */ public static String encode(String s, String enc) throws UnsupportedEncodingException { boolean needToChange = false; StringBuffer out = new StringBuffer(s.length()); Charset charset; CharArrayWriter charArrayWriter = new CharArrayWriter(); if (enc == null) throw new NullPointerException("charsetName"); try { charset = Charset.forName(enc); } catch (IllegalCharsetNameException e) { throw new UnsupportedEncodingException(enc); } catch (UnsupportedCharsetException e) { throw new UnsupportedEncodingException(enc); } for (int i = 0; i < s.length();) { int c = (int) s.charAt(i); //System.out.println("Examining character: " + c); if (dontNeedEncoding.get(c)) { if (c == ' ') { c = '+'; needToChange = true; } //System.out.println("Storing: " + c); out.append((char)c); i++; } else { // convert to external encoding before hex conversion do { charArrayWriter.write(c); /* * If this character represents the start of a Unicode * surrogate pair, then pass in two characters. It's not * clear what should be done if a bytes reserved in the * surrogate pairs range occurs outside of a legal * surrogate pair. For now, just treat it as if it were * any other character. */ if (c >= 0xD800 && c <= 0xDBFF) { /* System.out.println(Integer.toHexString(c) + " is high surrogate"); */ if ( (i+1) < s.length()) { int d = (int) s.charAt(i+1); /* System.out.println("\tExamining " + Integer.toHexString(d)); */ if (d >= 0xDC00 && d <= 0xDFFF) { /* System.out.println("\t" + Integer.toHexString(d) + " is low surrogate"); */ charArrayWriter.write(d); i++; } } } i++; } while (i < s.length() && ! dontNeedEncoding.get((c = (int) s.charAt(i)))); charArrayWriter.flush(); String str = new String(charArrayWriter.toCharArray()); byte[] ba = str.getBytes(charset); for (int j = 0; j < ba.length; j++) { out.append('%'); char ch = Character.forDigit((ba[j] >> 4) & 0xF, 16); // converting to use uppercase letter as part of // the hex value if ch is a letter. if (Character.isLetter(ch)) { ch -=  caseDiff; } out.append(ch); ch = Character.forDigit(ba[j] & 0xF, 16); if (Character.isLetter(ch)) { ch -= caseDiff; } out.append(ch); } charArrayWriter.reset(); needToChange = true; } } return (needToChange? out.toString() : s); }Copy the code

When you see this, you start to wonder. The notes are very clear:

Utility class for HTML form encoding. This class contains static methods for converting a String to the application/x-www-form-urlencoded MIME format. For more information about HTML form encoding, consult the HTML

Utility class for HTML form encoding. This class contains static methods for converting a string to an Application/X-www-form-urlencoded MIME format. More information about HTML form encoding. Please refer to www.w3.org/TR/html4/

This method is specifically used for web side encoding, why the Web side also have garbled? And the online custom coding method is also based on this method, the space and + processing (PS: since the direct processing, why not use the official method?) So what’s the real reason this method doesn’t work so well? Looking again at the JavaScript escape() function, notice that ECMAScript V3 deprecates this method and the application uses decodeURI() and decodeURIComponent() instead.

Is escape and unescape deprecated and the decodeURI function recommended to solve this problem? Let’s test it out:

URLEncoder. Encode (" ȋ   aU ‰ \ (including  e ‹ ‰ • ¨ © • g \ \ RRNN … EEOEIeO ‡ \ \ N j … OYaUOa – \ \ 浭 luan � 끐 ノ Benjamin k = RNI ⁋ 䁌 ἲ ︧ j ⁋ ð selections … ´ ἦ KNN … OYaUOaEIeO ‡ \ \ v ™  such vNN ‹ U Æ ¸ OO † \ ® ‘ N ” x EEI decide U ® † \ \ ‰ Uaaai ¯ LKD “ ˜ leiqq š ’ – b C š › “ ` dii – A ˜ œ – b – E Ÿ [NN ” x EEI decide U after ¯ IO ‡ \ \ shao  댓 돎 ꩻ what NN • Ø OO † \ \ l   ‹ ‘ ¯ A  lhfhe • š ˜ ™ geihd ’ ’ ijgmh ’ š › š p œ A ’ gXNN • Ø OO squared ¯ IO ‡ \ \ 脃  ʂ ingot NN • Ø OO ¸ 1/2 level OO ‡ \ \ Tbbc ^] da] iYQhqldlkfWNN • Ø OO ¸ 1/2 level OO ¸ COIY ’ \ kgghpofadheeg ^ N • Ø Ø æ U Ø Y · † \ \ ‰ Uaaai ¯ LKD “ ˜ leiqq š ’ – b C š › “ ` dii – A ˜ œ – b – E Ÿ [NN • cOOee • \ \ t — † … f Ÿ © N … Ð N † \ \ … EEOO ¢« EOAAU ‰ Ÿ, "" utf-8") / / %C8%8B%C2%9D%C2%8F%C3%A0%C3%9A%C2%89%5C%C2%B5%C2%9De%C2%8B%C2%89%C2%95%C2%A8%C2%AD%C2%A9%C2%95g%5C%5CRRNN%C2%85%C3%8B%C3 %89%C3%95%C3%88%C3%8D%C3%A9%C3%95%C2%87%5Cj%5CN%C2%85%C3%92%C3%9D%C3%A2%C3%99%C3%93%C3%A2%C2%96%5C%5C%E6%B5%AD%EF%A4%A0% 3F%EB%81%90%E6%82%AFk%EF%BE%89%EF%B9%A6%EF%BC%B2NI%E2%81%8B%E4%81%8C%E1%BC%B2%EF%B8%A7%EF%BD%8A%E2%81%8B%E4%80%B9%E1%BC% A6%EF%BC%ABNN%C2%85%C3%92%C3%9D%C3%A2%C3%99%C3%93%C3%A2%C3%88%C3%8D%C3%A9%C3%95%C2%87%5C%5Cv%C2%99%C2%9D%C2%ACvNN%C2%8B% C3%9C%C3%86%C2%B8%C3%93%C3%92%C2%86%5C%C2%AE%C3%A6%C3%A7%C3%9A%C2%91N%C2%94%C3%97%C3%88%C3%88%C3%8E%C3%9F%C3%9B%C2%AE%C2 %AD%C2%86%5C%5C%C2%89%C3%99%C3%A1%C3%A4%C3%A5%C3%AC%C2%AFlkd%C2%93%C2%98leiqq%C2%9A%C2%92b%C2%96%C3%87%C2%9A%C2%9B%C2%93 %60dii%C2%96%C3%83%C2%98%C2%9C%C2%96b%C2%96%C3%8B%C2%9F%5BNN%C2%94%C3%97%C3%88%C3%88%C3%8E%C3%9F%C3%9B%C2%B3%C2%AF%C3%8E %C3%92%C2%87%5C%5C%E9%82%B5%EE%B1%89%EB%8C%93%EB%8F%8E%EA%A9%BB%E4%B8%ACNN%C2%95%C3%98%C3%93%C3%92%C2%AD%C2%AD%C2%86%5C% 5Cl%C2%8D%C2%8D%C2%8B%C2%91%C2%A0%C2%AF%C3%85%C2%9Dlhfhe%C2%95%C2%9A%C2%98%C2%99geihd%C2%92%C2%92ijgmh%C2%92%C2%9A%C2%9B %C2%9Ap%C2%9C%C3%84%C2%92gXNN%C2%95%C3%98%C3%93%C3%92%C2%B2%C2%AF%C3%8E%C3%92%C2%87%5C%5C%E8%84%83%EE%B9%98%CA%82%E9%94% ADNN%C2%95%C3%98%C3%93%C3%92%C2%B8%C2%BD%C3%96%C3%92%C2%87%5C%5CTbbc%5E%5Dda%5DiYQhqldlkfWNN%C2%95%C3%98%C3%93%C3%92%C2% B8%C2%BD%C3%96%C3%92%C2%B8%C3%87%C3%95%C3%8E%C3%9D%C2%92%5Ckgghpofadheeg%5EN%C2%95%C3%98%C3%98%C3%A6%C3%9C%C3%98%C3%9D%C 2%B7%C2%AD%C2%86%5C%5C%C2%89%C3%99%C3%A1%C3%A4%C3%A5%C3%AC%C2%AFlkd%C2%93%C2%98leiqq%C2%9A%C2%92b%C2%96%C3%87%C2%9A%C2%9 B%C2%93%60dii%C2%96%C3%83%C2%98%C2%9C%C2%96b%C2%96%C3%8B%C2%9F%5BNN%C2%95%C3%A7%C3%95%C3%95%C3%A9%C3%A8%C2%95%5C%5Ct%C2% 97%C2%86%C2%85f%C2%9F%C2%A9N%C2%85%C3%90%C3%91%C2%86%5C%5C%C2%85%C3%8B%C3%89%C3%95%C3%93%C2%A2%C2%AB%C3%89%C3%95%C3%81%C 3%80%C3%9A%C2%89%C2%9FCopy the code
decodeURI("%C8%8B%C2%9D%C2%8F%C3%A0%C3%9A%C2%89%5C%C2%B5%C2%9De%C2%8B%C2%89%C2%95%C2%A8%C2%AD%C2%A9%C2%95g%5C%5CRRNN%C2%85%C3%8B%C 3%89%C3%95%C3%88%C3%8D%C3%A9%C3%95%C2%87%5Cj%5CN%C2%85%C3%92%C3%9D%C3%A2%C3%99%C3%93%C3%A2%C2%96%5C%5C%E6%B5%AD%EF%A4%A0 %3F%EB%81%90%E6%82%AFk%EF%BE%89%EF%B9%A6%EF%BC%B2NI%E2%81%8B%E4%81%8C%E1%BC%B2%EF%B8%A7%EF%BD%8A%E2%81%8B%E4%80%B9%E1%BC %A6%EF%BC%ABNN%C2%85%C3%92%C3%9D%C3%A2%C3%99%C3%93%C3%A2%C3%88%C3%8D%C3%A9%C3%95%C2%87%5C%5Cv%C2%99%C2%9D%C2%ACvNN%C2%8B %C3%9C%C3%86%C2%B8%C3%93%C3%92%C2%86%5C%C2%AE%C3%A6%C3%A7%C3%9A%C2%91N%C2%94%C3%97%C3%88%C3%88%C3%8E%C3%9F%C3%9B%C2%AE%C 2%AD%C2%86%5C%5C%C2%89%C3%99%C3%A1%C3%A4%C3%A5%C3%AC%C2%AFlkd%C2%93%C2%98leiqq%C2%9A%C2%92b%C2%96%C3%87%C2%9A%C2%9B%C2%9 3%60dii%C2%96%C3%83%C2%98%C2%9C%C2%96b%C2%96%C3%8B%C2%9F%5BNN%C2%94%C3%97%C3%88%C3%88%C3%8E%C3%9F%C3%9B%C2%B3%C2%AF%C3%8 E%C3%92%C2%87%5C%5C%E9%82%B5%EE%B1%89%EB%8C%93%EB%8F%8E%EA%A9%BB%E4%B8%ACNN%C2%95%C3%98%C3%93%C3%92%C2%AD%C2%AD%C2%86%5C %5Cl%C2%8D%C2%8D%C2%8B%C2%91%C2%A0%C2%AF%C3%85%C2%9Dlhfhe%C2%95%C2%9A%C2%98%C2%99geihd%C2%92%C2%92ijgmh%C2%92%C2%9A%C2%9 B%C2%9Ap%C2%9C%C3%84%C2%92gXNN%C2%95%C3%98%C3%93%C3%92%C2%B2%C2%AF%C3%8E%C3%92%C2%87%5C%5C%E8%84%83%EE%B9%98%CA%82%E9%94 %ADNN%C2%95%C3%98%C3%93%C3%92%C2%B8%C2%BD%C3%96%C3%92%C2%87%5C%5CTbbc%5E%5Dda%5DiYQhqldlkfWNN%C2%95%C3%98%C3%93%C3%92%C2 %B8%C2%BD%C3%96%C3%92%C2%B8%C3%87%C3%95%C3%8E%C3%9D%C2%92%5Ckgghpofadheeg%5EN%C2%95%C3%98%C3%98%C3%A6%C3%9C%C3%98%C3%9D% C2%B7%C2%AD%C2%86%5C%5C%C2%89%C3%99%C3%A1%C3%A4%C3%A5%C3%AC%C2%AFlkd%C2%93%C2%98leiqq%C2%9A%C2%92b%C2%96%C3%87%C2%9A%C2% 9B%C2%93%60dii%C2%96%C3%83%C2%98%C2%9C%C2%96b%C2%96%C3%8B%C2%9F%5BNN%C2%95%C3%A7%C3%95%C3%95%C3%A9%C3%A8%C2%95%5C%5Ct%C2 %97%C2%86%C2%85f%C2%9F%C2%A9N%C2%85%C3%90%C3%91%C2%86%5C%5C%C2%85%C3%8B%C3%89%C3%95%C3%93%C2%A2%C2%AB%C3%89%C3%95%C3%81% C3%80%C3%9A%C2%89%C2%9F")
// ȋ   aU ‰ \ (including  e ‹ ‰ • ¨ - © • g \ \ RRNN … EEOEIeO ‡ \ \ N j … OYaUOa – \ \ 浭 luan % 3 f k Benjamin 끐 ノ = RNI ⁋ 䁌 ἲ ︧ j ⁋ ð selections … ´ ἦ KNN … OYaUOaEIeO ‡ \ \ v ™  such vNN ‹ U Æ ¸ OO † \ ® æ cU ‘ N ” x EEI decide U ® - † \ Aaai ¯ LKD “ ˜ leiqq š ’ – b C š › “ ` dii – A ˜ œ – b – E Ÿ [NN ” x EEI decide U after ¯ IO ‡ \ \ shao  댓 돎 ꩻ what NN • Ø OO - † \ \ l   ‹ ‘ ¯ A  lhfhe • š ˜ ™ geihd ’ ’ ijgmh ’ š › š p œ A ’ gXNN • Ø OO squared ¯ IO ‡ \ \ 脃  ʂ ingot NN • Ø OO ¸ 1/2 level OO ‡ \ \ Tbbc ^] da] iYQhqldlkfWNN • Ø OO ¸ 1/2 level OO ¸ COIY ’ \ kgghpofadheeg ^ N • Ø Ø æ U Ø Y DE - † \ \ ‰ Uaaai ¯ LKD “ ˜ leiqq š ’ – b C š › “ ` dii – A ˜ œ – b – E Ÿ [NN • cOOee • \ \ t — † … f Ÿ © N … Ð N † \ \ … EEOO ¢« EOAAU ‰ Ÿ
Copy the code

The comparison is fine, so let’s look at some other data:

URLEncoder. Encode (" {\ "MSG \" : {\ "CHATTYPE \" : \ \ "0", \ "CHATTYPE \" : 0, \ "content \" : \ \ "test information @ + ~ (* $%...... ,; / "" \", \ "contentType \" : \ "TEXT \" and \ "isSend \ receiveId" : true, \ "\", \ "9527 \", \ "receiveName \", \ "wudang mountain \" and \ "sendId \" : \ "10086 \", \ "se NdName \":\" Zhang Sanfeng \",\"sendTime\ :\"2021-04-09 17:22:15\",\"sendTimeStamp\":1617960135052,\"sessionId\":\"10010\",\"status\":\"READ\"},\"cmd\":\"chat_ChatMsg\"}","utf- 8 ") / / %7B%22msg%22%3A%7B%22CHATTYPE%22%3A%220%22%2C%22chatType%22%3A0%2C%22content%22%3A%22%E6%B5%8B%E8%AF%95%E4%BF%A1%E6%81%A F%40%2B%EF%BD%9E%EF%BC%88*%24%25%E2%80%A6%E2%80%A6%EF%BC%8C%EF%BC%9B%2F%E2%80%9C%E2%80%9D%EF%BC%89%22%2C%22contentType%2 2%3A%22TEXT%22%2C%22isSend%22%3Atrue%2C%22receiveId%22%3A%229527%22%2C%22receiveName%22%3A%22%E6%AD%A6%E5%BD%93%E5%B1%B1 %22%2C%22sendId%22%3A%2210086%22%2C%22sendName%22%3A%22%E5%BC%A0%E4%B8%89%E4%B8%B0%22%2C%22sendTime%22%3A%222021-04-09+1 7%3A22%3A15%22%2C%22sendTimeStamp%22%3A1617960135052%2C%22sessionId%22%3A%2210010%22%2C%22status%22%3A%22READ%22%7D%2C%2 2cmd%22%3A%22chat_ChatMsg%22%7DCopy the code
decodeURI("%7B%22msg%22%3A%7B%22CHATTYPE%22%3A%220%22%2C%22chatType%22%3A0%2C%22content%22%3A%22%E6%B5%8B%E8%AF%95%E4%BF %A1%E6%81%AF%40%2B%EF%BD%9E%EF%BC%88*%24%25%E2%80%A6%E2%80%A6%EF%BC%8C%EF%BC%9B%2F%E2%80%9C%E2%80%9D%EF%BC%89%22%2C%22co ntentType%22%3A%22TEXT%22%2C%22isSend%22%3Atrue%2C%22receiveId%22%3A%229527%22%2C%22receiveName%22%3A%22%E6%AD%A6%E5%BD% 93%E5%B1%B1%22%2C%22sendId%22%3A%2210086%22%2C%22sendName%22%3A%22%E5%BC%A0%E4%B8%89%E4%B8%B0%22%2C%22sendTime%22%3A%222 021-04-09+17%3A22%3A15%22%2C%22sendTimeStamp%22%3A1617960135052%2C%22sessionId%22%3A%2210010%22%2C%22status%22%3A%22READ %22%7D%2C%22cmd%22%3A%22chat_ChatMsg%22%7D") // {" MSG "% % 3 a {" CHATTYPE" 3 a "0" % 2 "CHATTYPE" % 3 c a0%2 c "content" % 3 "test information % 40% 2 b ~ (* % 24%... ,; %2F "") "%2C"contentType"%3A"TEXT"%2C"isSend"%3Atrue%2C"receiveId"%3A"9527"%2C"receiveName"%3A" Wudang "%2C"sendId"%3A"10086"%2 C"sendName"%3A" Zhang Sanfeng "%2C"sendTime"%3A"2021-04-09+17%3A22%3A15"%2C"sendTimeStamp"%3A1617960135052%2C"sessionId"%3A"10010"%2 C"status"%3A"READ"}%2C"cmd"%3A"chat_ChatMsg"}Copy the code

After testing, it is proved that the garbled code is caused by the improper use of coding methods on the Web side. However, the source code for the JavaScript escape function implementation was not found through a query, so the solution/encoding method in this article is not 100% correct.