Make it a habit to like it first

preface

I saw a question about Tomcat in another community these days, which is quite interesting. As I have never thought about this question before, today I will talk about this “why” in combination with the Tomcat mechanism.This article on HTTP protocol file upload standard and Tomcat mechanism analysis content is more, more basic, do not need the big door can directly jump to the end of the article.

File upload in HTTP

HTTP is a text protocol, so how does a text protocol transfer files?

Direct transmission… Yes, it’s that simple. Text protocol is just at the application layer, at the transport layer it’s all bytes, no difference, no extra codec.

Multipart/form – the data

HTTP protocol provides a Form-based File Upload. Define an ENCTYPE attribute in the form with a value of multipart/form-data, and add an tag of type file.

 <FORM ENCTYPE="multipart/form-data" ACTION="_URL_" METHOD=POST>

   File to process: <INPUT NAME="userfile1" TYPE="file">

   <INPUT TYPE="submit" VALUE="Send File">

 </FORM>
Copy the code

This multipart/form-data form is a bit different from the default X-www-form-urlencoded form. While both are forms that can upload multiple fields, the former can upload files, while the latter can only transfer text

Here is a simple multipart/form-data request:As you can see from the figure above, the HTTP header part is the same, just a boundary tag in the Content-Type, but the payload part is completely different. Okay

Boundary is used in multipart/form-data to separate multiple fields of a form. In the payload part, there is a boundary at the beginning and end of each line, and there is also a boundary between each field (part/item)

When the Server reads, it only needs to fetch the boundary from the Content-Type, and then divide the payload through the boundary to obtain all fields.

There is a Content-Disposition field in the message for each field as part of the Header of this field. The name of the current field is recorded, along with a filename attribute in the case of a file, and a Content-Type in the next line to identify the file Type

Although both x-www-form-urlencoded and Multipart forms can transfer fields, multipart can transfer files as well as text fields. The multipart file transfer method is also “standard” and can be supported by various servers, reading files directly.

X-www-form-urlencoded can only transmit basic text data, but if you force a file to be text, no one can stop you from doing it, but if you do it as text, the back end will parse it as a string, and byte -> STR encoding overhead is not necessary. And it can lead to coding errors…

In x-www-form-urlencoded messages, there is no boundary and multiple fields will pass&Symbol concatenation and urlencode for both key and valueAlthough X-www-form-urlencoded added a step of coding, headers were not added to each field, and there was no boundary. The size of packets was much smaller than multipart.

In addition to this multipart, there is another form of direct file upload, which is less common

Binary payload way

In addition to multipart/form-data, there is another upload method, binary payload. The binary payload is my own name… There is no explanation for this in the HTTP protocol (post a link if you find one), but many HTTP clients support it.

Such as the Postman:Such as OkHttp:

OkHttpClient client = new OkHttpClient().newBuilder()
  .build();
MediaType mediaType = MediaType.parse("image/png");
RequestBody body = RequestBody.create(mediaType, "<file contents here>");
Request request = new Request.Builder()
  .url("localhost:8098/upload")
  .method("POST", body)
  .addHeader("Content-Type"."image/png")
  .build();
Response response = client.newCall(request).execute();
Copy the code

And this is a very simple way to do it, where you take the whole payload, and you store the file data. As shown in the following figure, the entire payload is the content of files:This approach is simple, and the client implementation is simple, but… The server side is not well supported. Tomcat, for example, does not treat the binary file as a file, but as a normal message.

Analysis of Tomcat processing mechanism

When Tomcat processes text packets, it first reads the Header and parses content-Length to define the packet boundary. The rest of the Payload is not read at a time, but packaged as an InputStream. Call Socket read internally to read RCV_BUF (The Size of the complete packet is larger than readBuf Size)

To call it the getParameter/getInputStream part involves the content read operations, will conduct InputStream internal Socket RCV_BUF read, Read Payload data.

Instead of reading all the data at once and storing it in memory, we wrap an InputStream to read RCV_BUF internally. An application layer read on ServletRequest#inputStream is forwarded to a read on Socket RCV_BUF.

However, if the application layer reads the entire ServletRequest#inputStream, converts the string, and stores it in memory, it has nothing to do with Tomcat.

Tomcat handles multipart requests in a special way. Because multipart is designed to transfer files, Tomcat adds the concept of a staging file to handle this type of request, writing the data in multipart to disk while parsing the message.

As shown in the figure below, Tomcat wraps each field as a DiskFileItem –org.apache.tomcat.util.http.fileupload.disk.DiskFileItem(This DiskFileItem does not distinguish between files and text data). DiskFileItem is divided into Header and Content parts. Part of the Content is stored in memory, and the rest is stored in disk, which is divided by a sizeThreshold.However, this value defaults to 0That is, by default, all the content is stored to disk.If it is stored to disk, it must also be read from disk… Efficiency is naturally low. Therefore, if only text packets are transmitted, do not use the multipart type. This type will be saved to disk.

When processing multipart packets, Tomcat adds the key/value of a field that is not a file to parameterMap. That is to say, by the request. The getParameter/getParameterMap fields can get these files.

//org.apache.catalina.connector.Request#parseParts

if (part.getSubmittedFileName() == null) {
    String name = part.getName();
    String value = null;
    try {
        value = part.getString(charset.name());
    } catch (UnsupportedEncodingException uee) {
        // Not possible}... parameters.addParameter(name, value); }Copy the code

GetParameter can only retrieve form parameters (FormParam) and query parameters (QueryString), but multipart is also a form.

A simple summary

Tomcat handles different types of requests:

  1. If the parameters are GET queryString, all the parameters are in the header and will be read into memory at once
  2. If the packet is of the POST type, Tomcat only reads headers and payloads. Instead, It wraps the Socket as an InputStream for the read layer
    1. X-www-form-urlencoded messages will not be read actively, but many Web frameworks (such as SpringMVC) will call getParameter and start read of InputStream to read RCV_BUF
    2. Tomcat does not initiate a read operation. The application layer calls ServletRequest#InputStream to read RCV_BUF data
    3. Multipart packets are also not read actively. Only HttpServletRequest#getParts triggers parsing/reading. Similarly, many Web frameworks call getParts, so parsing is triggered

Write to a temporary file. Wrap InputStream to the application layer to read.

If the application layer does not read RCV_BUF (in a timely manner), then when the received data is full of RCV_BUF, no ACK will be returned and the client’s data will be stored in SND_BUF and no further data will be sent. When the SND_BUF is full, the connection will be blocked.

The following reasons are personal opinions, without the support of official literature. If you have different opinions, please leave a comment in the comment section

Multipart is usually used to transfer files, but the file size is usually much larger than the Socket Buffer capacity. So, to avoid blocking TCP connections, Tomcat reads the full Payload at once and stores all of the parts to disk (headers are in memory and contents are on disk).

The application layer only needs to read the Part data from DiskFileItem provided by Tomcat, so that the data in RCV_BUF can be consumed in a timely manner.

In terms of efficiency, the operation of forwarding and saving disks must be much slower than that of not forwarding. However, RCV_BUF can be consumed in time to ensure that TCP connections are not blocked.

If multiple requests are using the same TCP connection under HTTP2 multiplexing, all “logical HTTP connections” will block if RCV_BUF is not consumed in time

Why don’t other types of packets be temporarily stored on disk?

Because the packet is small, ordinary request packets are not very large, and the common request packets are only a few to tens of KB. In addition, for plain text packets, the read operation must be timely and all the packets are read at one time. Multipart packets are different, because they are mixed with text and files and may contain multiple files.

For example, after receiving the file, the server needs to transfer the file to the object storage service of some cloud vendors. In this case, there are two ways to transfer the file:

  1. The complete file data is received, stored in memory, and the SDK of the object store is called
  2. As a stream, read ServletRequest#InputStream and write it into the SDK OutputStream

In mode 1, although RCV_BUF is Read in time, it occupies too much memory, which is very unreasonable. In mode 2, although RCV_BUF is Read in time, RCV_BUF cannot be consumed in time because the network is on both sides.

And not only Tomcat, Jetty also handles multipart in this way, other Web servers have not looked at it, but I think it should handle multipart in this way.

reference

  • Apache Tomcat
  • Form-based File Upload in HTML – IETF
  • Analysis of Tomcat architecture by Guangrui Liu

Original is not easy, prohibit unauthorized reprint. Like/like/follow my post if it helps you ❤❤❤❤❤❤