preface

In the past, we didn’t really think deeply about how strings work, so the new methods in ES6 were a bit of a puzzle, so we wanted to figure out why they were added and how to use them. Before we do that, we need to review the concepts of byte and base.

byte

A Byte (English: Byte), usually used as a unit of measurement of information, regardless of data type. Is the concept of communication and data storage.

One word saves 8 bits of binary data (this is the specification, need to be engraved in DNA)

1Byte = 8 bit 
Copy the code

2^8 is 256, and one word of energy is 0 to 255, 256 possibilities.

A one-bit hexadecimal number can be represented as a four-bit binary, so one word represents two hexadecimal numbers.

1KB = 1024 B 2^10 Byte 
1MB = 1024 KB 2^20 Byte 
1GB = 1024 MB 2^30 Byte
Copy the code

It’s actually 200Mb/s, but the file is in bytes, not bits, so it needs to be converted.

200Mb / 8 = 25 MB
Copy the code

Into the system

Radix The prefix The sample
Binary binary 0b 0B 0b11 = 2+1=3
Octal octal 0o 0O 0 0o11 = 8+1=9
Decimal decimal unprefixed 11 = 11
Hex 0x 0X 0x11
0b10 / / binary

0o10 / / octal

0xff / / hexadecimal
Copy the code

Hexadecimal conversion

  • parseInt(str,radix) : Converts the string STR to decimal according to radix radix encoding
  • Number.toString(radix): Returns a string representing the specified base form of the number, Radix support2 [4]
ParseInt (" 8 ", 10) / / = = > 8 (decimal) parseInt (" 13 ", 8) / / = = > 11 (decimal) (10). ToString (2) / / = = > 1010 (binary) (18). ToString (16) / / Function tansformRadix(num,m,n){var s = num+ "; var result = parseInt(s,m).toString(n); return result; }Copy the code

Javascript character representation

There are six ways that JavaScript can represent a character

'\ z = = = / /' z 'true' \ 172 '= = =' z '/ / true 8 hexadecimal' \ x7A '= = =' z '/ / true hexadecimal' \ u007A '= = =' z '/ / true unicode code set' \ {7} a u '  === 'z' // true es6Copy the code

A character encoding

ASCII

The ASCII code is A hexadecimal 0x41 for the letter A, 0x42 for the letter B, and so on:

The letter ASCII code
A 0x41
B 0x42
C 0x43
D 0x44

For details, see the ASCII code comparison table

"A".charCodeAt(0) // ==> 65 "A".codePointAt(0) // ==> 65 String.fromCharCode("0x41") // ==> A String.fromCharCode(65) //  ==> A String.fromCodePoint(65) // ==> A String.fromCodePoint("0x41") ===> ACopy the code

Unicode

Since ASCII encodings have a maximum of 127 characters, encoding more characters requires Unicode. The Unicode encoding for Chinese is 0x4e2D, and utF-8 requires 3 bytes:

Chinese characters Unicode Utf-8 encoding
In the 0x4e2d 0xe4b8ad
wen 0x6587 0xe69687
Ed 0x7f16 0xe7bc96
code 0x7801 0xe7a081

Correct recognition of Unicode encodings

JavaScript allows a character to be represented in the form \uxxxx, where XXXX represents the Unicode code point of the character. The Unicode character set ranges from U+0000 to U+ 10FFFF. This notation is limited to characters with code points between \u0000 and \uFFFF. Characters outside this range must be represented as two double bytes. (for example, \u20BB7,JavaScript will read \u20BB+7).

Console. log('\u4e2d\ \u7f16\u7801') // Chinese code console.log("\uD842\uDFB7") // 𠮷 console.log("\u20BB7") // ₻7 Garble // **ES6** improves on this by putting the code point in braces to read the character correctly. The console. The log (" \ u {20 bb7} ") / / 𠮷Copy the code

Js characters are encoded in memory according to UTF-16, that is, the basic unit of js default operation characters, each character is fixed at 2 bytes. For characters that require four bytes of storage (Unicode code points greater than 0xFFFF), JavaScript will treat them as two characters, which can cause problems.

methods describe
fromCharCode Takes a specified Unicode value and returns a string. The value is an integer between 0 and 655350xFFFFIs valid
charCodeAt Returns the Unicode encoding of the character at the specified position. The return value is an integer between 0 and 65535, that is, not greater than0xFFFFIs valid
fromCodePoint A string created using the specified Unicode encoding location
codePointAt The return value is the number represented in the encoding unit of the given index in the string

charAt vs charCodeAt vs codePointAt

Var s = "𠮷"; s.length // 2 s.charAt(0) // '' s.charAt(1) // '' s.charCodeAt(0) // 55362 ==> (55362).toString(16) ==> D842 S.charcodeat (1) // 57271 ==> (57271).toString(16) ==> DFB7 // Characters beyond 0xFFFF must be represented as two double bytes. The console. The log (" \ uD842 \ uDFB7 ") / / 𠮷 / / ES6 𠮷. CodePointAt (0) / / 134071 = = > (134071). The toString (16) = = > 20 bb7 FromCodePoint (134071) // 𠮷 "\u20bb7" ===> 𠮷Copy the code

In the code above, the Chinese character “𠮷” (note that this word is not auspicious) has a code point of 0x20BB7, utF-16 is encoded as 0xD842 0xDFB7 (55362 57271 in decimal) and requires four bytes of storage. JavaScript does not handle these 4-byte characters correctly, misinterpreting the string length as 2, and the charAt method cannot read the entire character. The charCodeAt method can only return the first and last two bytes respectively.

ES6 provides a codePointAt method that correctly handles 4-byte stored characters and returns a character code point

for… of

var text = String.fromCodePoint(0x20BB7);

for(let i = 0; i < text.length; i++){
   console.log(text[i]);
}
// �
// �

for(let i of text){
  console.log(i);
}
// "𠮷"
// String.fromCodePoint(0x20BB7).codePointAt(0).toString(16) // ==> 20BB7
Copy the code

Besides traversing strings, the greatest advantage of this traverser is that it can recognize code points larger than 0xFFFF, which traditional for loops cannot recognize.

String.formCharCode String.charAt() String.charCodeAt() String.formCodePoint()

UTF-8 vs UTF-16

Utf-16 means that any number corresponding to a character is stored in two bytes.

Utf-8 indicates that a character is mutable and can be expressed in 1 to 4 bytes as a symbol, varying the length of the byte depending on the symbol. When a character is in the ASCII range, it is represented as a byte, reserving the ASCII character’s one-byte encoding as part of it. The handler reads byte by byte and then identifies whether to treat one, two or three bytes as a unit based on the leading bit in the byte. You must follow the rules of the convention when using UTF-8:

  • 0xxxxxxx, if it is a string of 01, which starts with 0, it doesn’t matter what follows XX, which stands for any bit. That means a byte as a unit. Just like ASCII.
  • 110XXXXX 10XXXXXX. If this is the format, the two bytes are treated as a unit
  • 1110XXXX 10XXXXXX 10XXXXXX if this format is three bytes as a cell.

To know which encoding method the file is, you need to determine the flag at the beginning of the text. The following is the flag at the beginning of all encodings

The beginning mark encoding
EF BB BF UTF-8
FE FF UTF-16/UCS-2, little endian
FF FE UTF-16/UCS-2, big endian
FF FE 00 00 UTF-32/UCS-4, little endian
00 00 FE FF UTF-32/UCS-4, big-endian
/ / application scenarios: encodeURI to encode the url encodeURI (" http://www.cnblogs.com/season-huang/some other thing, "); = = > http://www.cnblogs.com/season-huang/some%20other%20thing "Spaces and other special characters are escaped in Chinese / / application scenarios: When you need to encode parameters in a URL, EncodeURIComponent ("http://www.baidu.com?callback=xxx") // ==> HTTP %3A%2F%2Fwww.baidu.com%3Fcallback%3Dxxx Separately encodes parts such as URL parameters or hashCopy the code

Base64 encoding

Base64 encoding can turn binary data of any length into plain text, and contains only A, Z, az, 0 to 9, +, /, and = characters. Its principle is to put 3 bytes of binary data according to a group of 6 bits, with 4 int integers, and then look up the table, the int integer with the index corresponding to the character, get the encoded string. For example, the E-mail protocol is a text protocol, and if you wanted to add a binary file to the E-mail, you could encode it in Base64 and send it as text.

The idea of Base64 encoding is to re-encode data using 64 basic ASCII characters.

  • 1. Divide the data to be encoded into byte array, arrange 24 bits of data in sequence with 3 bytes as a group, and then divide the 24 bits of data into 4 groups, namely 6 bits for each group;
  • 2. Add two zeros before the highest bit of each group to make up a byte, so that a group of 3 bytes of data is re-encoded into 4 bytes;
  • 3. When the number of bytes of data to be encoded is not an integer multiple of 3, that is, the last group is less than 3 bytes in grouping, then fill the last group with 1 or 2 0 bytes, and add 1 or 2 = signs at the end after the final encoding is completed.

Base64 encoding and decoding process

For example: three byte data ABC is 41, 42, 43, grouped by 6 bits to get 16, 20, 19, and 3:

Since the range of a 6-bit integer is always 0 63, it can be represented by 64 characters: the character AZ corresponds to index 0 25, the character AZ corresponds to index 26 51, the character 09 corresponds to indexes 52 to 61, and the last two indexes 62 and 63 are represented by + and/respectively.

┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ 42 43 │ │ │ │ 41 / / hex 0 x └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ ┌ ─ ┬ ─ ┬ ─ ┬ ─ ┬ ─ ┬ ─ ┬ ─ ┬ ─ ┬ ─ ┬ ─ ┬ ─ ┬ ─ ┬ ─ ┬ ─ ┬ ─ ┬ ─ ┬ ─ ┬ ─ ┬ ─ ┬ ─ ┬ ─ ┬ ─ ┬ ─ ┬ ─ ┐ │ │ │ │ │ │ 1 0 0 0 0 0 1 │ │ │ │ │ 1 0 0 0 │ │ │ │ │ │ 1 0 0 0 0 0 │ │ │ │ │ 1 0 0 0 0 1 │ │ │ / / binary 0 b └ ─ ┴ ─ ┴ ─ ┴ ─ ┴ ─ ┴ ─ ┴ ─ ┴ ─ ┴ ─ ┴ ─ ┴ ─ ┴ ─ ┴ ─ ┴ ─ ┴ ─ ┴ ─ ┴ ─ ┴ ─ ┴ ─ ┴ ─ ┴ ─ ┴ ─ ┴ ─ ┴ ─ ┘ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ │ │ │ │ 19 20 16 3 / / decimal └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ - first of all, take corresponding ASCII value A: ABC b01000001 65 = > 0, B: 66 = > 0 b01000010, C: 67 => 0b01000011 - Set the binary value A: 01000001, B: 01000010, C: 01000011 - Then join the three bytes of binary code 010000010100001001000011 - Then divide it into four data blocks by six bits and fill the top bit with two zeros to form the encoded value of four bytes 00010000, 00010100, 00001001, 00000011 - Then convert these 4 bytes of data into 10 base numbers 16, 20, 19, 3 - finally according to the 64 basic characters table given in Base64, Check the corresponding ASCII characters Q, U, J, and D. The value here is actually the index of the data in the character table. Decoding is the process of converting four bytes back into three bytes and rearranging byte arrays into data according to different data forms.Copy the code
  • The disadvantage of Base64 encoding is that transmission efficiency is reduced because it increases the length of the original data by a third.

  • Like URL encoding, Base64 encoding is an encoding algorithm, not an encryption algorithm.

  • If you replace the Base64 encoding table with 32, 48, or 58 characters, you can use Base32 encoding, Base48 encoding, and Base58 encoding. The fewer characters there are, the less efficient the encoding will be.

String to base64 & base64 to string

Function encode(STR){var encode = encodeURI(STR); Var base64 = btoa(encode); return base64; Function decode(base64){var decode = atob(base64); Var STR = decodeURI(decode); return str; }Copy the code

Picture to base64 & Base64 to picture

// 
function image2Base64(img) {
    var canvas = document.createElement("canvas");
    canvas.width = img.width;
    canvas.height = img.height;
    var ctx = canvas.getContext("2d");
    ctx.drawImage(img, 0, 0, img.width, img.height);
    var dataURL = canvas.toDataURL("image/png");
    return dataURL;
}


function getImgBase64(){
    var base64="";
    var img = new Image();
    img.src="img/test.jpg";
    img.onload = function(){
        base64 = image2Base64(img);
        alert(base64);
    }
}
Copy the code

URL encoding

Why do URIs need to be encoded?

In the case of urls, encoding is done because some characters in the Url are ambiguous. For compatibility reasons, many servers only recognize ASCII characters. But what if the URL contains non-ASCII characters such as Chinese and Japanese?

Generally speaking, urls can only use English letters, Arabic numerals, and certain punctuation marks, and cannot use other words and symbols. This is because the network standard RFC 1738 stipulates:

Original text: "... Only alphanumerics [0-9a-zA-Z], the special characters "$-_.+! (*)," [not including the quotes - ed], and reserved characters used for their reserved purposes may be used unencoded within a URL."Copy the code
"Only letters and numbers [0-9a-za-z], some special symbols" $-_.+! *'(), "[excluding double quotes], and reserved words that can be used unencoded in urls."Copy the code

The web standard does not specify how to encode the URL, leaving it up to the browser to control. The current common URL encoding rule for browsers is to replace all urls with % except a-za-z0-9.-_ :

  • If the character isAZ.az.0~9As well as-,_,.,*, remain unchanged;
  • If it is any other character, it is converted to UTF-8 encoding first, and then to each byte%XXSaid.
 https://www.baidu.com/s?wd=%E4%B8%AD%E6%96%87
Copy the code

How does JS encode urls?

Javascript provides three pairs of functions to encode urls to get valid urls

methods Safe character range The number of Usage scenarios
escape/unescape */@+-._0-9a-zA-Z 69 ES3 obsolete, usedencodeURI() andencodeURIComponent() to replace it
encodeURI/decodeURI ! # $' () *, + / :; =? @-._~0-9a-zA-Z 82 Uris are fully encoded, so the encodeURI() function does not escape the following ASCII punctuation marks that have special meaning in urIs:; /? : @ & = + $#, so it’s good for encoding full urls, because these characters are used to separate hosts from paths. Its corresponding decoding is decodeURI
encodeURIComponent/decodeURIComponent ! '()*-._~0-9a-zA-Z 71 It differs from encodeURI in that encodeURI encodes the entire URL, whereas encodeURIComponent encodes individual parts of the URL. So, like; /? : @ & = + $#These will also be encoded

HTML code

In order to properly display HTML pages, Web browsers must understand the character set used in the page.

<meta charset="UTF-8">
Copy the code

HTML characters & escape characters

Escape comparison table

HTML Character Entities & Escape Sequence

In HTML, escape strings are defined for two reasons. The first reason is that symbols like < and > are already used to represent HTML tags and therefore cannot be used directly as symbols in text. To use these symbols in HTML documents, you need to define its escape string. When the interpreter encounters such a string, it interprets it as a real character. When entering escaped strings, strictly follow the rules of letter case. The second reason is that some characters are not defined in the ASCII character set and therefore need to be represented using escaped strings.

The Escape Sequence, or Character Entity, is divided into three parts: the first part is an ampersand, or ampersand; The second part is the Entity name or # plus the Entity number; The third part is a semicolon.

For example, to display the less-than sign (<), you could write < Or & # 60; .

The advantage of using an Entity name is that it’s less than, but the disadvantage is that not all browsers support the latest Entity name. Entity numbers, however, can be handled by any browser.

The most commonly used character entity

According to the results describe The entity name The entity number
  The blank space &nbsp; The & # 160;
< Less than no. &lt; & # 60;
> More than no. &gt; The & # 62;
& And no. &amp; & # 38;
quotes &quot; & # 34;
apostrophe &apos;(IE not supported) & # 39;

HTML character escape and reverse escape

Function HTMLEncode(HTML) {let temp = document.createElement("div"); // div can also be replaced with pre (temp.textContent! = null) ? (temp.textContent = html) : (temp.innerText = html); const output = temp.innerHTML; temp = null; return output; } const tagText = "<p><b> 123&456 </b></p>"; console.log(HTMLEncode(tagText)); // &lt; p&gt; &lt; b&gt; 123&amp; 456 &lt; /b&gt; &lt; /p&gt;Copy the code
/ / HTML unescapes function HTMLDecode (text) {if (text = = = null | | text = = = undefined | | text = = = ' ') {return "'} if (typeof text ! == 'string') { return String(text) } let temp = document.createElement("div"); // div can also be replaced with pre temp.innerHTML = text; const output = temp.textContent || temp.innerText; temp = null; return output; } const tagText = "&lt; p&gt; &lt; b&gt; 123&amp; 456 &lt; /b&gt; &lt; /p&gt;" ; console.log(HTMLDecode(tagText)); //<p><b> 123&456 </b></p>Copy the code

Usage scenarios

When a user enters an HTML tag in the input box, it needs to be escaped to prevent XSS attacks. Display the need to reverse escape

HTMLEncode converts < > & “‘ into character entities

  • (1) User input in the page (such as input box)<script>alert(2); </script>, js submits the content to the backend for saving
  • (2) Display, the back end will return the string to the front end; Js received:

A. Use HTMLEncode to convert the string to <. script> alert(2); < /script> At this point, the browser will be able to parse correctly because the browser receives entity characters and converts them into Angle brackets and so on. B. HTMLEncode is not used. The browser sees < as the beginning of an HTML tag and executes the string as a script, which is XSS vulnerability.

2. HTMLDecode converts character entities into < > & “‘ usage scenario: the back end will show the escaped content to the page; Such as & lt; script> alert(2); < /script> After the js is received: a. Perform HTMLDecode on the front end. Dom operations can be performed directly to display labels on the page. B. If there is no HTMLDecode in front, , but is not executed at this time.

How do I prevent XSS attacks

Elegant solution

Encoding user input, escape special characters such as <, “, &, >, and make the browser display as a string.

// In artTemplete, XSS defends against escape. var escapeMap = { "<": "&#60;" And ">", "the & # 62;" , '" : "& # 34;" , "'" : "& # 39;" , "&" : "& # 38;" }; var escapeFn = function (s) { return escapeMap[s]; }; var escapeHTML = function (content) { return toString(content) .replace(/&(? ! [\w#]+;) |[<>"']/g, escapeFn); };Copy the code
Const matchList = {'&lt; ': '<', '&gt; ': '>', '&amp; ':' & ', '& # 34; ': '"', '&quot; ', '"', '& # 39; ': } // character filter const HtmlFilter = (text) => {let regStr = '(' + object.keys (matchList).toString() + ')' -- -- -- -- -- -- -- -- -- -- -- - * * extraction and matching list key values. 】 【 set number of turns string 】 regStr = regStr. Replace (/, / g, ') | (') / / write by matching the update for regular type string const regExp = new regExp (regStr, Return text.replace(regExp, match => matchList[match]) // ↑ ------ replace method (re, Current key => Return current matched key value)} export default HtmlFilterCopy the code
Violent solution

You can only set the specificity of the text with innerText/textContent.

function(value){ if(typeof value ! == 'string'){ return value; } var str = value || '', temp = document.createElement ("div"), obj; (temp.textContent ! = undefined ) ? (temp.textContent = str) : (temp.innerText = str); obj = temp.innerHTML; temp = null; return obj; }Copy the code
The plug-in
  • Front-end plug-ins to prevent XSS attacks jS-XSS library

summary

  • Base64 encoding and URL encoding are both encoding algorithms, not encryption algorithms.
  • The purpose of Base64 encoding is to encode arbitrary binary data into text, but the amount of data is increased by one-third after encoding.
  • The purpose of URL encoding is to encode any text data into % prefix text, which is easy for browser and server to process;
  • HTML escape character purpose Some characters are not defined in the ASCII character set and need to be represented by escape strings; While special HTML characters such as< >Etc. cannot be used directly as symbols in text.