This article is participating in node.js advanced technology essay, click to see details.

The Content-Disposition library is a library that generates and parses content-Disposition fields in Http headers in NodeJs. For example, when we implement server-side logic for Http file downloads and want the browser to automatically download the file when accessing the request instead of reading the display file, we need to set the Content-Disposition field of the response header.

The library is widely used in mainstream NodeJS libraries and is downloaded over 20 million times a week. Let’s start by looking at the content-Disposition field in the Http protocol.

Content-disposition field in the Http protocol

In a regular HTTP reply, the Content-Disposition response header indicates whether the Content of the reply should be displayed inline (that is, a web page or part of a page) or downloaded and saved locally as an attachment.

In HTTP scenarios, the first parameter is either inline (the default, which indicates that the message in the reply will be displayed as part of the page or as the entire page) or attachment (which means that the message body should be downloaded locally; Most browsers will present a “save as” dialog box, prefilling the value of filename with the downloaded filename, if it exists.)

Content-Disposition: inline
Content-Disposition: attachment
Content-Disposition: attachment; filename="filename.txt"
Copy the code

Content-disposition field document in the Http protocol

Here is a demonstration of creating an HTTP service based on the Content-Disposition field that lets browsers automatically download PDF resources.

const http = require('http');
const fs = require('fs');
const mime = require('mime');
const contentDisposition = require('content-disposition');


const server = http.createServer((req, res) = > {
  // PDF file name Normal resource path
  const pdfPath = __dirname + '/demo.pdf';
  // PDF Resource name Name of the file containing special characters ISO-8859-1
  // const pdfPath = __dirname + '/demo focus.pdf ';

  // Content-type of the PDF resource: application/ PDF
  const pdfType = mime.getType('pdf');
  const stream = fs.createReadStream(pdfPath);

  res.setHeader('Content-type', pdfType);
  // Set the download
  // The download window will pop up automatically when the browser types http://localhost:3000
  // If the request header field is not changed, the browser displays a preview of the PDF
  res.setHeader('Content-Disposition', contentDisposition(pdfPath));
  stream.pipe(res);
});

server.listen(3000.() = > {
  console.log('[pdf download server] running at port 3000.');
});
Copy the code

When you enter http://localhost:3000 in the browser window, the automatic download window will pop up, as shown below:

Content-disposition would look like a browser preview if the content-disposition logic was not set, as shown below:

We know from the above examples that a download service that wants the receiver to respond by downloading an attachment simply needs to set the Content-Disposition format. So what else do we do with the Content-Disposition library? What is the purpose?

The reason is, for example, if we specify that the downloaded filename contains characters outside the ISO-8859-1 character set (such as literal symbols for some Western European languages), whether isO-8859-1 should be transcoded automatically. And whether Content-Disposition should support the RFC 5987 standard.

Content-disposition of a normally named demo.pdf
'attachment; filename="demo.pdf"; '

// Content-disposition of demo y. PDF with isO-8859-1 special characters
'attachment; filename="demo? .pdf"; filename*=UTF-8''demo%C5%B8.pdf'
Copy the code

So before going into the source code for the Content-Disposition library, let’s talk briefly about different coding concepts and differences.

Brief talk about common code and difference in computer

As a programmer, in the computer coding must be more or less we have access to Unicode, GBK, ASCII, UTF8, UTF16, ISO8859-1 and so on. The so-called character set encoding is actually the character and the number in the computer (binary storage) to map up one by one.

Let’s take a look at some of the different coding concepts and differences, which are somewhat boring but the basis for understanding subsequent source code:

  • ASCII code

When computers were first invented in America, ASCII was born because coding only required 26 letters, numbers and some common punctuation characters. The standard ASCII code uses a 7-bit binary number, the highest bit is always 0, and can represent 128 characters. For example, character and ASCII transposition in JS:

// ASCII characters in decimal format
var str = 'A';
str.charCodeAt(); / / 65

// ASCII transposition in decimal notation
String.fromCharCode(65) // 'A'
Copy the code

The ASCII comparison table can be found here at www.habaijian.com/

  • Iso8859-1 encoding

As computers began to be used around the world, different languages put forward new requirements for character encoding, the original ASCII 128 characters are not enough, so the ASCII highest bit has been extended into 256 character encoding, this is ISO8859-1.

Iso-8859-1 encoding is single-byte encoding, backward compatible with ASCII, its encoding range is 0x00-0xFF, 0x00-0x7F is completely consistent with ASCII, 0x80-0x9F is the control character, 0xA0-0xFF is the character symbol. This character set supports some languages used in Europe.

In the ISO-8859-1 standard, 0x80-0xFF indicates the control character. Iso-8895-15 removes the control character in 0x80-0xFF, and instead adds the letters œ u, œ u, Y, š, ž, ž, euro (€), single quotation marks (” “), double quotation marks (” “), italic F (ƒ), ellipsis (…). , trademark (™), semicolon (‰) and other common symbols

—- is from Baidu Baike

  • GBK code

After the computer entered China, Chinese characters are very large, there are more than 6000 commonly used Chinese characters, ASCII/ISO8859-1 single character is unable to meet the coding requirements of Chinese characters. Therefore, we use 2 bytes to represent Chinese characters and characters, less than 127 with the original ASCII meaning of the same, so there is GBK2312 encoding support more than 7000 Chinese characters.

However, there are too many Chinese characters, GBK2312 encoding is still not enough, so we continue to expand multiple bytes, 3 bytes or even more bytes, as long as the first byte meets more than 127, we consider it is Chinese character encoding, this extended encoding is called GBK.

  • Unicode

For example, when China comes up with GBK, other countries expand their own codes, resulting in the same code with different meanings. Therefore, ISO (International Standardization Organization) developed the global unified Character Set coding scheme, called UCS(Universal Character Set), commonly known as Unicode. Unicode is a large set of 16-bit binary encoded characters, now larger than a million symbols.

The problem with Unicode is that computers have no way of knowing whether multiple bytes represent one character or multiple characters. The second problem is that Unicode’s requirement that all characters be represented in three or four bytes can cause a huge waste of storage for English, which requires only one byte.

  • UTF8 encoding

Utf-8 is the most widely used Unicode implementation. Its main feature is the variable-length encoding, which can use 1 to 4 bytes to represent a symbol, varying the length of the byte according to the symbol. Interpreting the UTF-8 encoding is simple. If the first byte is 0, the byte is a single character. If the first digit is 1, the number of consecutive 1’s indicates how many bytes the current character occupies.

  • URL codec

In development, we often use of URL encodeURIComponent | decodeURIComponent codecs, such as the following code:

// Encode the URL
// 'http%3A%2F%2Flocalhost%3A3000%2F%3Fquery%3D%E4%BD%A0%E5%A5%BD'
encodeURIComponent('http://localhost:3000/? Query = how are you ')
Copy the code

As you can see: is encoded as %3A,? Is encoded as %3F, etc., so where do these values after % come from? It’s the hexadecimal utF8 value. Decoding is to remove % and decode it into characters using UTf8.

Now, with the foundations of these different coding specifications in mind, let me break down the principles behind version 0.5.4 of the Content-Disposition library.

Content-disposition generation principle

The Content-Disposition library is a single file that exports two methods. The content-Disposition library is a single file that exports two methods.

'use strict'

/** * external to the exported module *@public* /

module.exports = contentDisposition
module.exports.parse = parse

function contentDisposition (filename, options) {}

function parse (string) {}
Copy the code

Let’s look at the implementation of the contentDisposition function, which ultimately generates the content-disposition value we need.

/** * Create content-disposition header **@param {string} [filename]
 * @param {object} [options]
 * @param {string} [options.type=attachment]
 * @param {string|boolean} [options.fallback=true]
 * @return {string}
 * @public* /

function contentDisposition (filename, options) {
  var opts = options || {}

  // Define type value, default value attachment
  // Tell the receiver how to display the response data,
  // attachment Indicates to save locally as an attachment download
  var type = opts.type || 'attachment'

  // Get the argument of the object format to generate the header related fields according to filename and fallback
  var params = createparams(filename, opts.fallback)

  // Format the object arguments into the string format of the header
  return format(new ContentDisposition(type, params))
}
Copy the code

The main logic is to define the default parameters, get the parameters of the object format to generate the header related fields according to filename and fallback, and finally format the object format parameters into the string format values of the real header and return them.

Look at the implementation of createParams:

/** * A regular object that matches characters outside the Latin1 character set@private* /
var NON_LATIN1_REGEXP = /[^\x20-\x7e\xa0-\xff]/g

// RFC 2616 standard TEXT syntax character range of the re object
var TEXT_REGEXP = /^[\x20-\x7e\x80-\xff]+$/

// Matches the percentage encoded regular object
var HEX_ESCAPE_REGEXP = /%[0-9A-Fa-f]{2}/

/** * Creates the header argument object ** from the filename and fallback arguments@param {string} [filename]
 * @param {string|boolean} [fallback=true] * If the filename value contains characters outside the ISO-8859-1 character set, * Whether to automatically upgrade the filename value to isO-8859-1 *@return {object}
 * @private* /

function createparams (filename, fallback) {
  if (filename === undefined) {
    return
  }

  var params = {}

  if (typeoffilename ! = ='string') {
    throw new TypeError('filename must be a string')}// Fallback defaults to true
  if (fallback === undefined) {
    fallback = true
  }

  if (typeoffallback ! = ='string' && typeoffallback ! = ='boolean') {
    throw new TypeError('fallback must be a string or boolean')}// If fallback is a string, it must conform to the ISO-8859-1 character set
  If filename does not conform to ISO-8859-1, the fallback value is used instead
  if (typeof fallback === 'string' && NON_LATIN1_REGEXP.test(fallback)) {
    throw new TypeError('fallback must be ISO-8859-1 string')}// The file name, Eg: demo.txt
  var name = basename(filename)

  // determine if name is suitable for quoted string
  // Whether the text conforms to RFC 2616 character set rules
  var isQuotedString = TEXT_REGEXP.test(name)

  // generate fallback name
  /** * Get the fallback value of filename's name * - if fallback is true, encode name to isO-8859-1 * - If it is a string, take fallback value directly instead of automatically converting ISO-8859-1 */
  var fallbackName = typeoffallback ! = ='string'
    ? fallback && getlatin1(name)
    : basename(fallback)
  * -fallback = true * -fallback = false * -fallback = true - If fallback is a string, the value of fallback is equal to the original value of name. */
  var hasFallback = typeof fallbackName === 'string'&& fallbackName ! == name// set extended filename parameter
  // If the encoding is ISO-8859-1, or contains text outside the RFC 2616 standard character set rule, or contains % transfer,
  // The header field with a value of filename* is used by default
  if(hasFallback || ! isQuotedString || HEX_ESCAPE_REGEXP.test(name)) { params['filename*'] = name
  }

  // set filename parameter
  // If RFC 2616 complies with character set rules, hasFallback is used
  // Select fallbackName or name
  if (isQuotedString || hasFallback) {
    params.filename = hasFallback
      ? fallbackName
      : name
  }

  return params
}
Copy the code

The main logic here is:

  • forfallbackAnd other parameters of the type verification
  • Gets the name of the file to download
  • Based on the filename value as wellfallbackTo determine whether to proceed with the file nameISO-8859-1coding
  • Determine whether to proceedRFC 2616Standard support
  • Finally returns the parameter object

Let’s look at how getlatin1 transcodes Unicode to ISO-8859-1:

/** * A regular object that matches characters outside the Latin1 character set@private* /
var NON_LATIN1_REGEXP = /[^\x20-\x7e\xa0-\xff]/g

/** * Get iso-8859-1 version of string@param {string} val
 * @return {string}
 * @private* /

function getlatin1 (val) {
  // simple Unicode -> ISO-8859-1 transformation
  // replace all characters except ISO-8859-1 with?
  return String(val).replace(NON_LATIN1_REGEXP, '? ')}Copy the code

Here to add isO-8859-1 code knowledge point:

  • ISO-8859-1Encoding is single-byte encoding, backward compatibleASCII, its coding range is0x00-0xFF
    • 0x00-0x7FBetween perfect andASCIIconsistent
    • 0x80-0x9FBetween is the control character
    • 0xA0-0xFFBetween is the text symbol
  • Latin1isISO-8859-1Alias for some environment writingLatin-1

Let’s take a look at the format logic to convert the argument object to the header string format:

/** * Format object to content-disposition HTTP header string **@param {object} obj
 * @param {string} obj.type
 * @param {object} [obj.parameters]
 * @return {string}
 * @private* /

function format (obj) {
  // Parameter object
  var parameters = obj.parameters
  / / type
  var type = obj.type

  if(! type ||typeoftype ! = ='string'| |! TOKEN_REGEXP.test(type)) {throw new TypeError('invalid type')}// Concatenate the beginning of content-disposition
  / / such as the attachment;
  var string = String(type).toLowerCase()

  // Concatenate the key/value string part of Content-disposition
  if (parameters && typeof parameters === 'object') {
    var param
    var params = Object.keys(parameters).sort()

    for (var i = 0; i < params.length; i++) {
      param = params[i]

      // Value logic:
      // If it is in key* format, encodes it into the RFC5987 standard HTTP character set
      // Otherwise, the string is directly translated with double quotation marks and then enclosed with double quotation marks
      var val = param.substr(-1) = = =The '*'
        ? ustring(parameters[param])
        : qstring(parameters[param])

      // Get the attachment; filename="a.txt"; Formatted string
      string += '; ' + param + '=' + val
    }
  }

  return string
}
Copy the code

The format logic is as follows:

  • Iterate over all keys of the parameter object
  • Performs different string concatenation logic depending on the type of argument
    • If it is*Ending, for examplefilename*The callustringwillvalueEncoded inRFC5987Standard format
    • Otherwise the callqstringOnly thevalueDouble quotation marks and then double quotation marks at the beginning and end
  • A common format for the final compilation, such asattachemnt; filename=demo.txt

Let’s look at how ustring translates value to RFC 5987:

/** * Matches all the special characters of the URL after encodeURIComponent, excluding the percent sign *@private* /
var ENCODE_URL_ATTR_CHAR_REGEXP = /[\x00-\x20"'()*,/:;<=>?@[\\\]{}\x7f]/g

/** * Encode a Unicode string for HTTP (RFC 5987)@see https://datatracker.ietf.org/doc/html/rfc5987
 * @param {string} val
 * @return {string}
 * @private* /

function ustring (val) {
  var str = String(val)

  // percent encode as UTF-8
  // encodeURIComponent encodeURIComponent STR, and then translate the special characters into hexadecimal ASCII format
  // ENCODE_URL_ATTR_CHAR_REGEXP matches urls with special characters except %
  var encoded = encodeURIComponent(str)
    .replace(ENCODE_URL_ATTR_CHAR_REGEXP, pencode)

  return 'UTF-8\'\'' + encoded
}
Copy the code

Here you can see that encodeURIComponent is first used for encoding, and then special characters in the URL other than % are translated into ASCII hexadecimal. Let’s look at how qstring translates value:

/** * Matches the regular object * with double quotes@private* /
var QUOTE_REGEXP = /([\\"])/g

/** * Quote a string for HTTP@param {string} val
 * @return {string}
 * @private* /
function qstring (val) {
  var str = String(val)

  // This line of logic begins with double quotation marks
  // Replace logic replaces double quotes in the string QUOTE_REGEXP matches with '\\$1'
  // '\\$1' replaces \\"
  return '"' + str.replace(QUOTE_REGEXP, '\ \ $1') + '"'
}
Copy the code

Content-disposition reverse resolution principle

Content-disposition can be generated, so it is possible to resolve type and other argument objects in reverse. Correspondence in Content-Disposition is through the use of parse logic

const pdfPath = __dirname + '/demo.pdf';
const params = contentDisposition.parse(
  contentDisposition(pdfPath),
);
Copy the code

Let’s take a look at how the parse function reverse-parses:

/** * Parse the Content-Disposition header string *@param {string} string
 * @return {object}
 * @public* /

function parse (string) {
  if(! string ||typeofstring ! = ='string') {
    throw new TypeError('argument string is required')}// The matched type contains Spaces and semicolons
  var match = DISPOSITION_TYPE_REGEXP.exec(string)

  if(! match) {throw new TypeError('invalid type format')}// normalize type
  var index = match[0].length
  The result of the first subexpression of match represents the value of type
  var type = match[1].toLowerCase()

  var key
  var names = []
  var params = {}
  var value

  // calculate index to start at
  // The argument starts at the position after type is matched
  index = PARAM_REGEXP.lastIndex = match[0].substr(-1) = = ='; '
    ? index - 1
    : index

  // Start matching subsequent key=value arguments with the re
  while ((match = PARAM_REGEXP.exec(string))) {
    if(match.index ! == index) {throw new TypeError('invalid parameter format')
    }

    index += match[0].length
    / / get the key
    key = match[1].toLowerCase()
    / / get the value
    value = match[2]

    if(names.indexOf(key) ! = = -1) {
      throw new TypeError('invalid duplicate parameter')
    }

    names.push(key)

    // If key ends with *, for example filename*
    if (key.indexOf(The '*') + 1 === key.length) {
      // decode extended value
      // key intercepts the part before *
      key = key.slice(0, -1)
      // value to decode
      value = decodefield(value)

      // overwrite existing value
      params[key] = value
      continue
    }

    // Existing ignore
    if (typeof params[key] === 'string') {
      continue
    }

    // If the value is encoded in double quotes and the first and last double quotes
    if (value[0= = ='"') {
      // remove quotes and escapes
      // Remove closing double quotes and decode the translated double quotes inside value
      value = value
        .substr(1, value.length - 2)
        .replace(QESC_REGEXP, '$1')
    }

    params[key] = value
  }

  if(index ! = = -1&& index ! == string.length) {throw new TypeError('invalid parameter format')}return new ContentDisposition(type, params)
}
Copy the code

The parse logic is as follows:

  • Using the re to matchtypeA partial string of
  • Again fromtypeThe next section uses the re to start matching similarly; key=valueThe parameters of the
  • Remove the trailing * from the matched key
  • Remove the double quotation marks at the beginning and end of the matched value and decode the value itself

Let’s look at how to match the regex object of Type:

/** * RegExp for various RFC 6266 grammar * * disposition-type = "inline" | "attachment" | disp-ext-type * disp-ext-type  = token * disposition-parm = filename-parm | disp-ext-parm * filename-parm = "filename" "=" value * | "filename*" "=" ext-value * disp-ext-parm = token "=" value * | ext-token "=" ext-value * ext-token = <the characters in token, followed by "*"> *@private* /
/** * matches the content-disposition type field * - matches logic with 1- N characters in the first parenthesis * - followed by 0-n tabs or Spaces * - followed by semicolons or previous rules until the end */
var DISPOSITION_TYPE_REGEXP = /^([!#$%&'*+.0-9A-Z^_`a-z|~-]+)[\x09\x20]*(? : $|) /
Copy the code

According to the syntax of RFC 6266 standard, it is written as the corresponding direct object, and the logic of the regular object is:

  • The matching logic is1-nThe character in the first parenthesis
  • Followed by0-nA TAB character or space
  • Then match the semicolon or the preceding rule until the end

There is a similar match; Key =value (RFC 2616); RFC 2616 (RFC 2616);

/ RFC 2616 version of the standard grammar * * * * * parameter = token "=" (token | quoted - string) * token = 1 * (any CHAR except CTLs or separators> * separators = "(" | ")" | "<" | ">" | "@" * | "," | ";" | "" |" \ | "" < /" > * | "" |"/" | "" |"?" | "=" * | "{" | "}" | SP | HT * quoted-string = ( <"> *(qdtext | quoted-pair ) <"> ) * qdtext = <any TEXT except <">> * quoted-pair = "\" CHAR * CHAR = <any US-ASCII character (octets 0 - 127)> * TEXT = <any OCTET except CTLs, but including LWS> * LWS = [CRLF] 1*( SP | HT ) * CRLF = CR LF * CR = <US-ASCII CR, carriage return (13)> * LF = <US-ASCII LF, linefeed (10)> * SP = <US-ASCII SP, space (32)> * HT = <US-ASCII HT, horizontal-tab (9)> * CTL = <any US-ASCII control character (octets 0 - 31) and DEL (127)> * OCTET = <any 8-bit sequence  of data> *@private* /
var PARAM_REGEXP = /; [\x09\x20]*([!#$%&'*+.0-9A-Z^_`a-z|~-]+)[\x09\x20]*=[\x09\x20]*("(? :[\x20!\x23-\x5b\x5d-\x7e\x80-\xff]|\\[\x20-\x7e])*"|[!#$%&'*+.0-9A-Z^_`a-z|~-]+)[\x09\x20]*/g
Copy the code

About the standard can view where datatracker.ietf.org/doc/html/rf rfc2616 standard document…

Decodefield decodefield decodefield decodefield

/** * Decode the value of RFC 5987 *@param {string} str
 * @return {string}
 * @private* /

function decodefield (str) {
  var match = EXT_VALUE_REGEXP.exec(str)

  if(! match) {throw new TypeError('invalid extended field value')}var charset = match[1].toLowerCase()
  var encoded = match[2]
  var value

  // to binary string
  // Convert hexadecimal ASCII encoded in percent format to character format
  var binary = encoded.replace(HEX_ESCAPE_REPLACE_REGEXP, pdecode)

  switch (charset) {
    case 'iso-8859-1':
      value = getlatin1(binary)
      break
    case 'utf-8':
      value = Buffer.from(binary, 'binary').toString('utf8')
      break
    default:
      throw new TypeError('unsupported charset in extended field')}return value
}
Copy the code

Take a look at the regular expression that matches the parse value:

/** * ext-value = charset "'" [language] "'" value-chars * charset = "UTF-8"/"ISO-8859-1" / mime-charset * mime-charset = 1*mime-charsetc * mime-charsetc = ALPHA / DIGIT * / "!" / "#" / "$" / "%" / "&" * / "+" / "-" / "^" / "_" / "`" * / "{" / "}" / "~" * language = ( 2*3ALPHA [ extlang ] ) * / 4ALPHA * / 5*8ALPHA * extlang = *3( "-" 3ALPHA ) * value-chars = *( pct-encoded / attr-char ) * pct-encoded = "%" HEXDIG  HEXDIG * attr-char = ALPHA / DIGIT * / "!" "#"/" $"/" & "/" + "/" - "/", "* /" ^ "/" _ "/" ` "/" | "/" ~ *"@private* /

var EXT_VALUE_REGEXP = /^([A-Za-z0-9!#$%&+\-^_`{}~]+)'(? : [A Za - z] {2, 3} (? : - [A Za - z] {0, 3} {3}) | [A Za - z] {4, 8} | '((?) :%[0-9A-Fa-f]{2}|[A-Za-z0-9!#$&+.^_`|~-])+)$/
Copy the code

More about RFC5987 can view the document RFC5987 datatracker.ietf.org/doc/html/rf…

reference

  • Unicode character set list www.tamasoft.co.jp/en/general-…
  • ASCII list www.habaijian.com/
  • Rfc5987 standard Datatracker.ietf.org/doc/html/rf…
  • Wikipedia ISO – 8859-1 Baike.baidu.com/item/ISO-88…
  • Talk about the encodings in computers (Unicode, GBK, ASCII, UTF8, UTF16, ISO8859-1, etc.) and the solutions to garble problems

I am leng hammer, I and the front end of the story continues……