Welcome to “Algorithms and the Beauty of Programming” ↑ pay attention to us!

This article was first published on the wechat official account “Beauty of Algorithms and Programming”. Welcome to follow and learn more about this series of blogs in time.

Some of the descriptions in this article may seem wordy to some, but if you want to get straight to the point, read the “Source Code Analysis” section. But this is the record of his own analysis to solve the problem step by step way of thinking, while thinking also is not very deep in some places, mainly due to the time is not very abundance (although spent three days, but the feeling is not enough), I will in the subsequent post, combined with the actual problems or see other people’s questions in the BBS, Step by step in-depth analysis of Tomcat source code with problems, this way of source code analysis with problems, more targeted, not to let oneself lost in the sea of source code. If you have any suggestions on blog format or other aspects, please kindly point them out, thank you very much.

The objectives of this source code analysis are:

Clear org. Apache. Catalina. Conector. RequestFacade: : getQueryString () and the getParameter (String) of the differences and their respective implementation, achieve this goal is to complete the task.

The introduction

The problem is due to the derivation of the recently saw a post on oschina www.oschina.net/question/82… .

Problem analysis

Hashmap Destroy Encoding is the cause of this problem, so we googled hashMap Encoding to get a more relevant answer

Stackoverflow.com/questions/8… The situation in this post is also strange.

The function of the program is described as follows: read A set of space-delimited strings from the file A, and then write these strings line by line to another file B. For example, the format of file A is Aaa BBBBB cdefGGg… . File B is in the format of Aaa Bbbbb Cdefgggg… .

Program code:

final StringBuffer fileData = new StringBuffer(1000); final BufferedReader reader = new BufferedReader( new FileReader("fileIn.txt")); char[] buf = new char[1024]; int numRead = 0; while ((numRead = reader.read(buf)) ! = -1) { final String readData = String.valueOf(buf, 0, numRead); fileData.append(readData); buf = new char[1024]; } reader.close(); String mergedContent = fileData.toString(); mergedContent = mergedContent.replaceAll("\\<.*? > ", ""); mergedContent = mergedContent.replaceAll("\\r\\n|\\r|\\n", " "); final BufferedWriter out = new BufferedWriter( new OutputStreamWriter( new FileOutputStream("fileOut.txt"))); final HashMap<String, String> wordsMap = new HashMap<String, String>(); final String test[] = mergedContent.split(" "); for (final String string : test) { wordsMap.put(string, string); } for (final String string : wordsMap.values()) { out.write(string + "\n"); } out.close();Copy the code

In this case, the contents of file B are found to be garbled, whereas if you change some of the code in the above program to the following, you will get the desired result.

. for (final String string : test) { out.write(string + "\n"); //wordsMap.put(string, string); } //for (final String string : wordsMap.values()) //{ // out.write(string + "\n"); //} out.close();Copy the code

I don’t quite understand the reason for the occurrence of this situation. The answers to the post in the original text have nothing to do with the question. Most people are talking about how to solve this problem without mentioning the reasons for the occurrence of the above situation.

After reading this post and other related posts, I found that the problem in the introduction seems to have nothing to do with hashMap encoding, and there may be other reasons, so I wrote a simple servlet to practice.

practice

To start with the problem recurrence, I wrote a simple servlet that looks like this:

/ / the requested url is: http://localhost:8080/demo/1.do? Addr = Shanghai

@Override
	protected void doGet(HttpServletRequest req, HttpServletResponse resp)
			throws ServletException, IOException {
		//System.out.println(req);
		System.out.println("Request::getParameter(addr) is: "+ req.getParameter("addr"));
		String queryString = req.getQueryString();
		System.out.println("queryString is: "+queryString);
		String[]  params = queryString.split("[=]");
		
		Map<String, String> map = new HashMap<String, String>();
		
		map.put(params[0], params[1]);
		System.out.println("Map::get(addr) is: "+map.get(params[0]));
		return;
	}
Copy the code

When run, the result is:

GetParameter () yields garbled values, while the values parsed by getQueryString() and stored in the map are UTF-8 encoded.

GetParameter () is garble, this is obvious because the default urlencoding in browsers is usually UTF-8, and the default URIEncoding in Tomcat is ISO-8859-1, not UTF-8. When a request from a client arrives at Tomcat, tomcat uses a different encoding to decode UTF-8, which naturally results in garbled code (for details on how Tomcat handles queryString, read the source code analysis section below). So the solution is to add the following configuration (URIEncoding=” UTF-8 “) to the Tomcat configuration file server.xml:

<Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443" URIEncoding=" UTF-8 "/>Copy the code

 

By modifying the configuration file above, we get the following test results:

According to the above analysis, it can be concluded that the value of getParameter() is obtained by decode according to the URIEncoding set in Tomcat, while tomcat does not decode getQueryString(). The original urlencoding method is retained.

At this point, we can basically speculate that the reason for the occurrence of the situation in the introduction is:

The URL encoding method of the CLIENT for THE HTTP GET request is inconsistent with the URIEncoding defined in Tomcat. As a result, the Tomcat server uses another decoding method to decode the CLIENT URL, which inevitably leads to Chinese garbled characters. Why is there no garbled string in the Map? The reason is that getQueryString() does not decode the client URL, so it retains the original CLIENT UTF-8 encoding, so if it is decoded using UTF-8, there will be no Chinese garbled phenomenon.

Source code analysis

After the above practice, you can basically determine the cause of the problem, but for further verification, I tried to analyze how Tomcat handles getParameter() and getQueryString() differently.

Since HttpServletRequest is an interface, we can’t see the implementation of getParameter() and getQueryString(), so we first need to determine what the implementation class of Request is. We add the following code to the servlet:

System.out.pritnln(req);



Through the printing results, we can see the specific implementation class is org. Apache. Catalina. The RequestFacade, so the next step of our work is a concrete analysis of this class is how to deal with, also is to analyze two functions of processing flow, One is RequestFacade: : getQueryString (), another is RequestFacade: : the getParameter (String).

The first step is to obtain the source code of Tomcat. The usual approach is to clone the remote Git library in Eclipse through the EGit plug-in, and then import the project.

With all the preparations in place, the next step is the specific source code analysis:

From the org. Apache. Catalina. Connector. RequestFacade this class, we can see that this is a used fa&ccedil; Ade mode of packaging class, so we need to first understand fa&CCEDIl; Knowledge of ADE pattern.

Facade**** This section describes the mode

At the core of the facade pattern is to provide a unified interface for a set of interfaces to a subsystem so that clients can use the subsystem without having to care about the implementation of the subsystem.

Application of facade design pattern:

1. The original class provides a lot of complicated interfaces, but we only need to use part of its interfaces;

2. The interface provided by the original class can only partially meet our needs, and we do not want to rewrite a new class to replace the original class;

.

In this article, RequestFacade is an encapsulation of Request. Since Request itself provides many interfaces, only part of its functions need to be used in this system. In the actual analysis, We found that the specific work of the Request was eventually delegated to the underlying coyote.request.

RequestFacade::getQueryString() Analysis of the

How to read and analyze source code? My general approach is to analyze the normal processing logic first, so that I can quickly understand the overall architecture or key processes by ignoring the logs, error handling, variable definitions, etc.

Based on the above ideas, the processing process is as follows:

-RequestFacade::getQueryString()

         -Request::getQueryString()

                   -org.apache.coyote.Request::queryString()::toString()
Copy the code

 

As you can see from the above analysis, the processing process is relatively simple, with the delegate step by step and the coYote.request actually doing the work, so we only need to analyze how the class is handled.

Related function source code is as follows:

org.apache.catalina.connector.RequestFacade::getQueryString()

@Override
    public String getQueryString() {
        if (request == null) {
            throw new IllegalStateException(
                            sm.getString("requestFacade.nullRequest"));
        }
        return request.getQueryString();
    }
Copy the code

 

org.apache.catalina.connector.Request::getQueryString()

/**

     * Return the query string associated with this request.

     */

    @Override
    public String getQueryString() {
        return coyoteRequest.queryString().toString();
    }
Copy the code

 

org.apache.coyote.Request::queryString()

public MessageBytes queryString() {
        return queryMB;
    }
Copy the code

 

Coyote. Request: : the queryString () work is very simple, is only part of the return type for MessageBytes queryMB field, but when this field is the assignment? This is a very important question to figure out, because it is very likely that decode will take place before assignment.

QueryMB **** assignment analysis

When is queryMB assigned?

QueryMB is org. Apache. Coyote. Request a private member variables, the data type for MessageBytes, defined as follows:

private MessageBytes queryMB = MessageBytes.newInstance();
Copy the code

 

How do we figure out when the queryMB variable is assigned? In Eclipse, right-click queryMB and select Open Call Hierarchy to see where queryMB is called, as shown in the screenshot below:

As you can see from the figure above, there are three places where queryMB is called:

public MessageBytes queryString() {
        return queryMB;
}
Copy the code

 

The function gets a queryMB object, and once you get it, it’s possible to do something with it like assign.

Public void recycle() {... . queryMB.recycle(); ... .}Copy the code

 

As the name implies, querymb.recycle () is a queryMB that is recycled from a queryMB.

public Request() {
        parameters.setQuery(queryMB);
        parameters.setURLDecoder(urlDecoder);
}
Copy the code

 

The Request() constructor assigns a value to its member variable parameters, independent of the queryMB assignment.

Based on the above analysis of the three cases, we conclude that assignment is most likely to occur in the first case, so we’ll continue to analyze which functions queryString() is called by, as shown below:

ParseRequestLine (Boolean) is the function that is most likely to receive a value for queryString(). For example, parseRequestLine(Boolean) is a function that parses the request line.

/** * Read the request line. This function is meant to be used during the * HTTP request header parsing. Do NOT attempt to read the request body * using it. * * @throws IOException If an exception occurs during the underlying socket * read operations, or if the given buffer is not big enough to accommodate * the whole line. * @return true if data is properly fed; false if no data is available * immediately and thread should be freed */ @Override public boolean ParseRequestLine (Boolean useAvailableData) throws IOException {... . if (questionPos >= 0) { request.queryString().setBytes(buf, questionPos + 1, end - questionPos - 1); request.requestURI().setBytes(buf, start, questionPos - start); } else { request.requestURI().setBytes(buf, start, end - start); }... .}Copy the code

 

From the above code, we can see that when parsing the HTTP Request line, we do operate on queryMB, getting the bytes directly from the InputBuffer and assigning to queryMB.

request.queryString().setBytes(buf, questionPos + 1,
                                           end - questionPos - 1);
Copy the code

 

GetQueryString ()

Request::queryMB() returns coyote.request ::queryMB(), which directly parses the HTTP Request line. And assigns the resulting byte array directly to Coyote. Request::queryMB.

(in the first RequestFacade invokes the getQueryString (), and then passed on to the Request: : getQueryString (), Call getQueryString() to return the MessageBytes queryMB field value stored in the object. QueryMB is used to parse the HTTP Request line. Get the raw bytes information directly and store it in queryMB. At this point, getQueryString() returns the Http Request line byte information without any processing from the upper layer.

RequestFacade: : the getParameter () analysis

(We know that in Web development, HTTP GET request and HTTP POST request are mostly processed. For GET request, we can get it directly from URL through getParameter() method, but for POST request, there will be requsetBody. So how does Tomcat handle that? See subsequent blog analysis)

GetQueryString () = getParameter(); getParameter() = getParameter();

-RequestFacade::getParameter(String) -Request::getParameter(String) -Request::parseParameters() -coyote.Request::getParameters() -Parameters::setLimit(int) -Parameters::setEncoding(String) -Parameters::handleQueryParameters() -decodedQuery.duplicate(MessageBytes) -Parameters::processParameters(MessageBytes, String) - the Parameters: : processParameters (byte [], int, int, String) -coyote.Request::getParameters()::getParameter(String) -Parameters::paramHashValues.get(String)Copy the code

 

Appendix, the concrete realization of each function source code, students in need can be in here for my.oschina.net/gschen/blog… .

From the above process, we can see that, in the end, the processing function is a Parameter: : processParameters (byte [], int, int), the following will focus on analysis of the method.

Parameter: : processParameters (byte [], int, int, String) this function has four parameters, the first Parameter type is a byte [], is handleQueryParameter () function, obtain a copy of queryMB, Then pass it to processParameters(MessageBytes,String) and then to processParameters(byte[],int,int,String).

 

 // -------------------- Processing --------------------
    /** Process the query string into parameters
     */
    public void handleQueryParameters() {
       ...
        try {
            decodedQuery.duplicate( queryMB );
        } catch (IOException e) {
            // Can't happen, as decodedQuery can't overflow
            e.printStackTrace();
        }
        processParameters( decodedQuery, queryStringEncoding );
    }
Copy the code

The second and third arguments are both of type int and are the start position of the queryString and the length of the queryString as follows:

public void processParameters( MessageBytes data, String encoding ) {
        ...
        ByteChunk bc=data.getByteChunk();
        processParameters( bc.getBytes(), bc.getOffset(),
                           bc.getLength(), getCharset(encoding));
    }
Copy the code

The fourth argument is String, which means which method is used for decoding. If not, the default encoding is used for decoding. This will be explained step by step in a later blog post: Parsing Server.xml for Tomcat source code analysis).

Parse queryMB step by step, add each parsed parameter to a HashMap<String, ArrayList<String>>, and find the desired value based on name in the HashMap.

Parameters: : handleQueryParameters () function first get queryMB a copy, so that you can avoid the queryMB direct manipulation, destroy the original information, And then passed on to the Parameters: : processParameters (DecodedQuery, String), Finally by the Parameter: : processParameters (byte, int, int) processing, the first Parameter to this method is queryMB a copy, function to parse the copy is the basic function of after the decoding of each Parameter, Add to a HashMap<String, ArrayList<String>>, such as paramHashValues.

  // -------------------- Parameter parsing --------------------
    // we are called from a single thread - we can do it the hard way
    // if needed
    ByteChunk tmpName=new ByteChunk(); 
    ByteChunk tmpValue=new ByteChunk();
    private final ByteChunk origName=new ByteChunk();
    private final ByteChunk origValue=new ByteChunk();
    CharChunk tmpNameC=new CharChunk(1024);
    public static final String DEFAULT_ENCODING = "ISO-8859-1";
    private static final Charset DEFAULT_CHARSET =
    Charset.forName(DEFAULT_ENCODING);
Copy the code

Remember the default encoding problem? You guessed it, this is where the default encoding is defined

public static final String DEFAULT_ENCODING = "ISO-8859-1";
Copy the code

The basic idea is: Iterate through the byte array to get the name and value, and then call urlDecoder to decode the name and value. Add to Parameter::HashMap<String, ArrayList< String >> by calling addParameter(name,value).

QueryString Parameter parsing algorithm description

Pos: start position end: end position while(pos < end) nameStart: initialize to pos, parameter nameStart position nameEnd: ValueStart: Initialize to -1, the end position of the parameter name. NameStart and nameEnd can obtain the parameter name. ValueStart: Initialize to -1, the start position of the parameter value valueEnd: ParameterComplete: parsingName: a Boolean type initialized to true to indicate whether the name is being parsed. Boolean type, initialized to false, used to indicate whether a parameter has been parsed do swtich(the byte corresponding to the current position pos) '=': NameEnd = pos, parsingName = false, pos++, valueStart = pos; Otherwise pos++; '&': Whether parameter names are being resolved, if so, nameEnd=pos, otherwise valueEnd =pos, pos++; ParameterComplete =ture pos++; default: pos++; While (parameter not parsed and pos < end) if(pos == end) if(nameEnd == -1) nameEnd = pos; If (valueStart >-1 && valueEnd == -1) valueEnd = pos; end whileCopy the code

Algorithm for review

The essence of the above algorithm lies in four positional indicator and two Boolean variables. After completing a parameter resolution, the value of parameter.name is obtained by nameStart and nameEnd. ValueStart, valueEnd gets the value of parameter.value, parameterComplete is used to indicate whether a parameter has been parsed, and parsingName is used to indicate whether a name is being parsed. In some cases, parameter.value may be null, such as name=&passwd=123.

Algorithm source code

int pos = start; int end = start + len; while(pos < end) { int nameStart = pos; int nameEnd = -1; int valueStart = -1; int valueEnd = -1; boolean parsingName = true; boolean decodeName = false; boolean decodeValue = false; boolean parameterComplete = false; do { switch(bytes[pos]) { case '=': if (parsingName) { // Name finished. Value starts from next character nameEnd = pos; parsingName = false; valueStart = ++pos; } else { // Equals character in value pos++; } break; case '&': if (parsingName) { // Name finished. No value. nameEnd = pos; } else { // Value finished valueEnd = pos; } parameterComplete = true; pos++; break; case '%': case '+': // Decoding required if (parsingName) { decodeName = true; } else { decodeValue = true; } pos ++; break; default: pos ++; break; } } while (! parameterComplete && pos < end); if (pos == end) { if (nameEnd == -1) { nameEnd = pos; } else if (valueStart > -1 && valueEnd == -1){ valueEnd = pos; }}... }Copy the code

The above code is traversed once to get four indicators nameStart, nameEnd, valueStart and valueEnd, so that name and value can be obtained. After parameter.name and parameter.value are obtained, urldecode is performed on them. After decode is complete, addParameter(name, value) is called to add them to the HashMap.

            tmpName.setBytes(bytes, nameStart, nameEnd - nameStart);
            if (valueStart >= 0) {
                tmpValue.setBytes(bytes, valueStart, valueEnd - valueStart);
            } else {
                tmpValue.setBytes(bytes, 0, 0);
            }

            try {
                String name;
                String value;

                if (decodeName) {
                    urlDecode(tmpName);
                }
                tmpName.setCharset(charset);
                name = tmpName.toString();

                if (valueStart >= 0) {
                    if (decodeValue) {
                        urlDecode(tmpValue);
                    }
                    tmpValue.setCharset(charset);
                    value = tmpValue.toString();
                } else {
                    value = "";
                }

                try {
                    addParameter(name, value);
                } catch (IllegalStateException ise) { // Hitting limit stops processing further params but does
                    ...
                }
            } catch (IOException e) {
               ...
            }

            tmpName.recycle();
            tmpValue.recycle();
Copy the code

The above code is to urldecode tmpName and tmpValue, and then decode the information addParameter.

A few notes on decode:

After the name and value are obtained, UDecoder is called to decode them. If URIEncoding is not defined in Tomcat’s server.xml, the default “ISO-8859-1” is used to decode them.

Futuer work

In the process of this source analysis, there are some unresolved problems, will be in the later analysis of the process, step by step to solve.

List of questions:

1. When does Tomcat load the server. XML configuration file and obtain the URIEncoding value?

2. How does digest parse XML files?

3. How the underlying Coyote is implemented;

.

In the next article will analyze this post www.oschina.net/question/85… The cause of the problem

If you are interested in algorithms or programming, please scan the qr code below and follow the public account “Beauty of Algorithms and programming” to explore the mystery of algorithms and programming with you, and give you different solutions and analysis ideas.