Understanding binary data

Binary is a number system widely used in computing technology. Binary data are numbers represented by 0 and 1 digits. Its cardinal number is 2, the carry rule is “every two into one”, the borrowing rule is “borrow one when two”, discovered by the 18th century German mathematical philosophy master Leibniz.

— Baidu Encyclopedia

Binary data is stored by zeros and ones, as shown above. Ordinary decimal numbers into binary numbers generally adopt the “mod 2, reverse order” method, with 2 exactly divided by decimal integers, can get a quotient and remainder; Then divide the quotient by 2, and another quotient and remainder will be obtained, and so on until the quotient is less than 1. Then, the remainder obtained first as the lowest significant bit of the binary number, and the remainder obtained later as the highest significant bit of the binary number are arranged in order. For example, if the number 10 is converted to binary 1010, the number 10 is stored in the computer as 1010.

For example, the ACSII code for the letter A is 97. The binary representation is 0110 0001. In JavaScript, we can use the charCodeAt method to get the corresponding ASCII characters:

In addition to ASCII, there are other encodings to map different characters, such as the Chinese character we used, whose UTF-16 encoding is obtained through JavaScript’s charCodeAt method.

Node handles binary data

JavaScript was mainly used for processing form information in its early days, so JavaScript is naturally good at handling strings. It can be seen that the prototype of String provides many convenient ways to manipulate strings.

However, on the server side, it is not enough to only manipulate characters. In particular, some IO operations on network and files need to support binary data streams, and Node.js Buffer is designed to support these operations. Fortunately, after the release of ES6, the concept of TypedArray was introduced and the binary data processing capability was gradually supplemented. Now, it can be directly used in Node.js. However, in Node.js, Buffer is more suitable for binary data processing and has better performance. Of course, Buffer can also be seen directly as Uint8Array in TypedArray. In addition to Buffer, Node.js also provides a Stream interface, which is mainly used to handle I/O operations on large files, as opposed to slicing files in batches.

Know the Buffer

As the name implies, a Buffer instantiated in Node.js is a Buffer specifically used to store binary data. A Buffer can be thought of as an area of memory opened up, and the size of the Buffer is the size of the area opened up. Here’s a look at the basic use of Buffer.

Introduction of the API

Earlier buffers were created using constructors that allocated different buffers for different parameters.

new Buffer(size)

Create a Buffer of size(number).

new Buffer(5)
// <Buffer 00 00 00 00 00>
Copy the code

new Buffer(array)

Allocates a new Buffer using an eight-bit byte array array.

const buf = new Buffer([0x74.0x65.0x73.0x74])
// <Buffer 74 65 73 74>
// The hexadecimal numbers correspond to t, e, s, t

// Convert the Buffer instance to a string to get the following result
buf.toString() // 'test'
Copy the code

new Buffer(buffer)

Copy buffer data to the new buffer instance.

const buf1 = new Buffer('test')
const buf2 = new Buffer(buf1)
Copy the code

new Buffer(string[, encoding])

Create a string Buffer with encoding.

const buf = new Buffer('test')
// <Buffer 74 65 73 74>
New Buffer([0x74, 0x65, 0x73, 0x74])

buf.toString() // 'test'
Copy the code

A safer Buffer

Because Buffer instances execute different results depending on the first parameter type, it can easily lead to security issues if the developer does not validate the parameters. For example, IF I want to create a Buffer with the string “20” and pass in the number 20 by mistake, I create an instance of a Buffer with length 20.

As you can see in the figure above, before Node.js 8, for the sake of high performance, the memory space opened by Buffer does not release the existing data, and returning the Buffer directly may lead to the leakage of sensitive information. As a result, the Buffer class was overhauled around Node.js 8 and the use of the Buffer constructor instance Buffer is no longer recommended. Instead, use buffer.from (), buffer.alloc (), and buffer.allocunsafe () instead of new Buffer().

Buffer.from()

This method is used to replace new Buffer(string), new Buffer(array), and new Buffer(Buffer).

Buffer.alloc(size[, fill[, encoding]])

This method is used to replace the new Buffer(size). The Buffer instance created by this method will fill the memory with 0 by default, that is, it will overwrite all the previous data in memory. It is safer than the previous new Buffer(size), because it will cover the previous memory space, which also means lower performance.

Also, if the size argument is not a number, TypeError is raised.

Buffer.allocUnsafe(size)

This method is consistent with the previous new Buffer(size). Although this method is not safe, it has obvious performance advantages over Alloc.

Buffer encoding

The encoding of binary data and characters needs to be specified, and the encoding of string to Buffer and Buffer to string needs to be specified.

Node.js currently supports the following encoding modes:

  • hex: Encodes each byte into two hexadecimal characters.
  • ascii: Applies only to 7-bit ASCII data. This encoding is fast and, if set, strips high bits.
  • utf8: Multi-byte Unicode character. Many web pages and other document formats use UTF-8.
  • utf16le: 2 or 4 bytes, little endian encoded Unicode characters.
  • ucs2:utf16leThe alias.
  • base64: Base64 encoding.
  • latin1: a willBufferA method of encoding a single-byte encoding string.
  • binary:latin1The alias.

JavaScript charCodeAt uses UTF-16 encoding, or strings in JavaScript are stored in UTF-16. However, the default Buffer encoding is UTF-8.

You can see that a Chinese character takes up three bytes in UTF-8, but only two bytes in UTF-16. The main reason is that UTF-8 is a variable length character encoding. Most characters use one byte to represent more space, while some characters that exceed one byte need two or three bytes to represent. Most Chinese characters need three bytes to represent in UTF-8. Utf-16 uses two bytes for all characters, and four bytes for the following characters that exceed two bytes. The 2-byte UTF-16 encoding is exactly the same as Unicode, and most of the Corresponding Unicode encoding of Chinese characters can be found through the Unicode encoding table of Chinese characters. The aforementioned “han” is represented by Unicode as 6C49.

The Unicode encoding mentioned here is also known as Unicode, universal code, and single code. It sets a unified and unique binary code for each language, and UTF-8 and UTF-16 are one of their implementations. More details about coding are not needed and are not the focus of this article, but you can search for more.

The cause of garbled characters

We often have some garbled characters, which is caused by the different encoding used in the conversion process between string and Buffer.

We’ll create a new text file, save it with UTF16 encoding, and read the modified file from Node.js.

const fs = require('fs')
const buffer = fs.readFileSync('./1.txt')
console.log(buffer.toString())
Copy the code

Since Buffer uses UTF8 encoding by default when calling toString method, garbled characters are output. Here, we change toString encoding mode to UTF16 to output normal characters.

const fs = require('fs')
const buffer = fs.readFileSync('./1.txt')
console.log(buffer.toString('utf16le'))
Copy the code

Know the Stream

As mentioned earlier, node.js can use a Buffer to store binary data, but if the data volume is very large, the use of Buffer can consume a considerable amount of memory, in which case you need to use node.js Stream. To understand flow, you must know the concept of a pipe.

In Unix-like operating systems (and some others that borrow this design, such as Windows), a pipeline is a series of processes that link standard input and output, with the output of each process being directly input to the next. This concept was invented by Douglas McRoy for the Unix command line and is named for its resemblance to a physical pipe.

— from Wikipedia

We often pipe the results of one command to another at the Linux command line, for example, to search for files.

ls | grep code
Copy the code

Here, ls is used to list the files in the current directory, and grep is given to find the files containing the keyword code.

The concept of pipes is also used in gulP, the front-end build tool. The use of pipes simplifies the workflow and instantly exceeds grunt’s user base.

// Compile SCSS using gulp
const gulp = require('gulp')
const sass = require('gulp-sass')
const csso = require('gulp-csso')

gulp.task('sass'.function () {
  return gulp.src('./**/*.scss')
    .pipe(sass()) / / SCSS CSS
    .pipe(csso()) / / compress CSS
    .pipe(gulp.dest('./css'))})Copy the code

Before said so many pipes, that pipe and flow should be directly how to contact. Flow can be understood as water flow, where water flows is determined by pipes, if there is no pipe, water can not form water flow, so the flow must be attached to the pipe. All IO operations in Node.js can be done by streaming, because the nature of IO operations is to flow from one place to another. A network request, for example, flows data from the server to the client.

const fs = require('fs')
const http = require('http')

const server = http.createServer((request, response) = > {
    // Create a data stream
    const stream = fs.createReadStream('./data.json')
    // Pipe the data flow to the response flow
    stream.pipe(response)
})

server.listen(8100)
Copy the code
// data.json
{ "name": "data" }
Copy the code

Json is read while writing data to the response Stream, rather than reading the entire data.json into memory and then writing it to the response at once, as Buffer does.

The Stream is still running on buffers internally. If we compare a piece of binary data to a bucket of water, then transferring a file through Buffer directly pours water from one bucket to another, while using Stream, the water from the bucket is pumped through a pipe.

Stream vs. Buffer memory consumption

This may not seem obvious to you, but let’s copy a 2gb file through Stream and Buffer to see how much memory the Node process consumes.

Stream Copy file

// Stream copies the file
const fs = require('fs');
const file = './file.mp4';
fs.createReadStream(file)
  .pipe(fs.createWriteStream('./file.copy.mp4'))
  .on('finish', () = > {console.log('file successfully copy');
  })
Copy the code

Buffer copy files

// Buffer copies the file
const fs = require('fs');
const file = './file.mp4';
// fs.readFile outputs buffers directly
fs.readFile(file, (err, buffer) => {
    fs.writeFile('./file.copy.mp4', buffer, (err) => {
        console.log('file successfully copy');
    });
});
Copy the code

As you can see from the above figure, copying through Stream takes up only 0.6% of my computer’s memory, while using Buffer takes up 15.3%.

Introduction of the API

There are five types of Steam in Node.js.

  • Readable, a stream that can read data;
  • Writable, a stream that can write data;
  • Duplex, a stream that is both readable and writable;
  • Transform flow, data flow that can be modified and converted arbitrarily during reading and writing (also read and write flow);

Can all the flow through the pipe is the pipe (|) in similar to Linux for data consumption. Alternatively, you can monitor the flow of data through events. Both file reads and writes, as well as HTTP requests and responses, automatically create streams internally. When a file is read, a readable Stream is created, and when a file is output, a writable Stream is created.

#### Readable

Although it is called a readable stream, a readable stream is also writable, but the write operation is usually done internally, and the external stream only needs to be read.

Readable streams generally fall into two modes:

  • Flow mode: Indicates that data is being read, usually through event listening to get data in the stream.
  • Pause mode: The data in the stream is not consumed. If the data in the stream needs to be read in pause mode, it needs to be explicitly calledstram.read().

Readable streams are created in pause mode by default, and will automatically switch to flow mode once.pipe is called or a data event is listened for.

const { Readable } = require('stream')
// Create a readable stream
const readable = new Readable()
// Bind the data event to change the mode to flow mode
readable.on('data', chunk => {
  console.log('chunk:', chunk.toString()) / / output chunk
})
// Write 5 letters
for (let i = 97; i < 102; i++) {
  const str = String.fromCharCode(i);
  readable.push(str)
}
// Push 'null' to indicate that the stream has ended
readable.push(null)
Copy the code

const { Readable } = require('stream')
// Create a readable stream
const readable = new Readable()
// Write 5 letters
for (let i = 97; i < 102; i++) {
  const str = String.fromCharCode(i);
  readable.push(str)
}
// Push 'null' to indicate that the stream has ended
readable.push('\n')
readable.push(null)
// Output the stream's data to the console through a pipe
readable.pipe(process.stdout)
Copy the code

All of the above code creates a readable stream manually and then writes data into the stream using the push method. As mentioned earlier, node.js writes to data are implemented internally. Here is an example of a readable stream created by fs reading files:

const fs = require('fs')
// Create a readable stream of data.json files
const read = fs.createReadStream('./data.json')
// Listen for the data event, which becomes flow mode
read.on('data', json => {
  console.log('json:', json.toString())
})
Copy the code

Writable streams

Compared with the readable stream, the writable stream is really writable and belongs to the type of pixiu, similar to that of pixiu.

When creating a writable stream, you have to implement a _write() method manually. Because the underscore prefix indicates that this is an internal method and is not usually implemented directly by the user, this method is defined internally in Node.js. For example, a file writable stream writes the Buffer passed in to the specified text in this method.

If the write ends, the writable stream’s.end() method is usually called to end the write, and the Finish event is also called.

const { Writable } = require('stream')
// Create a writable stream
const writable = new Writable()
// Bind the _write method to print the written data on the console
writable._write = function (chunk) {
  console.log(chunk.toString())
}
// Write data
writable.write('abc')
// Finish writing
writable.end()
Copy the code

The _write method can also be implemented by passing in the object’s write property when the instance is writable.

const { Writable } = require('stream')
// Create a writable stream
const writable = new Writable({
  // Bind the _write method
	write(chunk) {
    console.log(chunk.toString())
  }
})
// Write data
writable.write('abc')
// Finish writing
writable.end()
Copy the code

Let’s take a look at the writable streams created internally via FS in Node.js.

const fs = require('fs')
// Create a writable stream
const writable = fs.createWriteStream('./data.json')

// Write data, consistent with the writable stream you created manually
writable.write(`{ "name": "data" }`)
// Finish writing
writable.end()
Copy the code

Node.js calls the.end() method to end an HTTP response, which is essentially a writable stream. Now it’s easier to look back at the code that copied the file through Stream.

const fs = require('fs');
const file = './file.mp4';
fs.createReadStream(file)
  .pipe(fs.createWriteStream('./file.copy.mp4'))
  .on('finish', () = > {console.log('file successfully copy');
  })
Copy the code

Duplex flow

The duplex stream implements both Readable and Writable, but you can refer to both Readable and Writable streams for specific usage, which won’t take up space in this article.

Pipe series

You can transfer data from one bucket to another through a pipe(.pipe()), but when there are multiple buckets, you need to call.pipe() multiple times. For example, we have a file that needs to be gzip compressed and reexported.

const fs = require('fs')
const zlib = require('zlib')

const gzip = zlib.createGzip() // Gzip is a duplex stream, readable and writable
const input = fs.createReadStream('./data.json')
const output = fs.createWriteStream('./data.json.gz')

input.pipe(gzip) // File compression
gzip.pipe(output) // Output after compression
Copy the code

In this case, Node.js provides a Pipeline () API that allows multiple pipeline operations to be done at once, and error handling is supported.

const { pipeline } = require('stream')
const fs = require('fs')
const zlib = require('zlib')

const gzip = zlib.createGzip()
const input = fs.createReadStream('./data.json')
const output = fs.createWriteStream('./data.json.gz')

pipeline(
  input,   / / input
  gzip,    / / compression
  output,  / / output
  // The last argument is a callback function used to catch errors
  (err) => {
    if (err) {
      console.error('Compression failed', err)
    } else {
      console.log('Compression succeeded')}})Copy the code

reference

  • Character coding note
  • Buffer | Node.js API
  • stream | Node.js API
  • stream-handbook