Source: gist.github.com/liangchenhe… ⚠️ source code according to the use of scenarios customized, in order to maximize the efficiency of the use of abstract structure, such as the need to use please optimize.
demand
Due to project requirements, which require frequent parsing and computation of mathematical expressions on low-performance devices, heavyweight sets such as lexical analysis, syntax analysis, and abstract syntax tree 🌲 are not appropriate. (Of course, not bad, but a little overqualified, and personal ability is limited, not good at ANTLR tuning and so on)
Speaking of mathematical formula analysis, of course, can not leave the old friend inverse Polish expression, also known as postfix expression, he can ignore parentheses and other priorities, one by one. We’re going to start from there and gradually add other features.
First, summarize the requirements:
- Supports addition, subtraction, multiplication and division with parentheses, and parentheses nesting.
- Support for negative/negative expressions and continuous negative signs. Such as
1 * 2 = 2
.1 * - 2 = 2
.1 * - (2 + 3) = 5
. - Support for fixed functions and nesting of functions and expressions. Such as
1+abs(-2-3)=6
. - Constants are supported, for example
pi
.e
.
Because scenarios are used frequently, do not use temporary objects during parsing, and if you do use them, create a pool of objects to avoid excessive GC pressure.
Basic implementation
The first is the most basic, infix to suffix expression. To recap the basic idea:
Prepare an OP stack store operator. Scan the string from scratch and print the number (let’s call it output), otherwise assume the top OP element is A and the scanned identifier is B:
- If B is the close parenthesis, the stack elements are output out of the stack up to the open parenthesis. Parentheses out of the stack do not output.
- If A is empty or A is an open parenthesis or B is an open parenthesis, B is pushed.
- If the priority of B is higher than that of A, B is pushed.
- Otherwise, A outputs the stack and loops until one of the above conditions is met.
After the scan, all OP elements are output out of the stack to obtain a complete suffix expression.
Using the above algorithm, the expression 2*(3+4) is converted to output 234+*. Then you just scan the output sequentially, hit the number, hit the operator off the stack operand, and push the result back on the stack, until the only element on the stack is the answer.
It’s not hard to see that this is the equivalent of walking twice. In fact, we can prepare an expression exp stack and plug it directly into exp during the output process. If we encounter an operator, we can evaluate and insert the result into exp. So you can just walk through it and get the answer.
💡 Tips
If disk B is pushed, it can be found that the elements that may exist in the stack contain operators and open parentheses. Closing parentheses are never pushed. That is, the possible values of A are the operator and the open parenthesis. Counterintuitively, we just need to define the left parenthesis (defined as having the lowest priority) to treat the parenthesis as an operator and count it in the priority comparison, avoiding a small piece of special logic. Note, however, that when both AB and AB are left parentheses, B should be pushed, not A, even though the priority is the same. So rather than define the int priority directly, I prefer to define a two-dimensional bool array that can quickly find out whether any two symbols should be compared on the stack or on the stack.
So far we have solved the four operations and the parenthesis problem.
The extra demand
function
My basic idea is to treat the function as a special operator (like the plus sign). Then there are several problems:
- The operands of functions come after the operators.
- How to define the priority of a function.
- The delimiter of the argument
.
What to do. - When can functions be pushed off the stack?
The first question doesn’t really matter. So let’s think about the normal minus minus process. The left-hand operand is printed first, then the minus sign is pushed, and finally the right-hand operand is printed. Compare this to a function: the function (special operators) is first pushed, and the arguments are then printed (as operands). Obviously, the final output and stack are the same: the operands are printed from left to right, and the operators are pushed onto the stack. So the position of the operands relative to the operators is not important.
The second problem is simpler. Obviously, the function should have the highest priority, and it should be pushed directly, but how do the two functions compare? In fact, this cannot happen because the function must be followed by an open parenthesis (once the open parenthesis is pushed, subsequent operators must be pushed directly. Therefore, when a function is pushed onto the stack, the top of the stack cannot also be a function.
As for the parameter delimiter, I treat it as the end of an expression. Since each argument can be a nested expression, we know that the function must evaluate all arguments before execution. In this way, the parameter delimiter, like the close parenthesis, does not need to be pushed onto the stack, and other elements in the stack are ejected until the open parenthesis (. But do not push the open parenthesis off the stack, because the open parenthesis and the function are one and the same, and the delimiter is only the end of the argument, not the end of the function.
Finally, since the function has the highest priority, when does it get off the stack? Obviously you can’t wait for the final set to get out of the stack, otherwise it would be a Schrodinger parameter. In other words, when all the operands are printed out, the function should be pushed off the stack as soon as possible. The answer is clear: the function itself should exit the stack when the function terminates, and the close parenthesis) is the function terminator. This symbol has been defined in the base section, and when a close parenthesis is encountered, we loop out of the stack until we reach the left parenthesis. If the top element is a function after the left parenthesis is removed from the stack, it is removed and printed. This ensures that the function is adjacent to all its arguments in postfix expressions.
Minus sign
The minus sign can be thought of as a monocular operator. To distinguish it from the minus sign (-), the wavy line (~) refers to the minus sign. Then there are only two problems to solve:
- When does a minus need to be a minus sign?
- How about the priority of the minus sign?
To solve the easy ones first, the minus sign should have the highest priority (except for functions). Because the minus sign is always combined with the following number or expression, it is not affected by the preceding operator. Such as 1 * ~ 2 = 2.
To solve problem 1, I list all the possible data types in the entire expression, and then determine how the minus sign should be interpreted.
In front of the symbol | The semantic |
---|---|
Left parenthesis( |
Minus sign |
Other operators+-x/ 等 |
Minus sign |
digital | A minus sign |
Right parenthesis) |
Additional judgment |
Close parentheses and so on are a little bit more complicated. Normally the front is an expression, so the back should be a minus sign. But our syntax allows empty parentheses (), which are parsed to mean nothing. Exp stack = exp stack = exp stack = exp stack = exp stack = exp stack = exp stack = exp
So we need to use a variable to record the type of the last symbol, so that we can use it to determine semantics. The continuous negative case requires no additional processing.
constant
Constants are the simplest extra requirement, so simple that they don’t even require extra processing. When you read a constant, you just treat it as a number.
Analytical problem
Finally, let’s talk about parsing. Because a Token may consist of multiple characters, for example the number 12.3 has four characters and the function ABS has three characters. So essentially we have to write a custom lexical parser by hand, which is not too difficult because the syntax is relatively simple. Considering the usage scenario, the purpose here is to scan the entire string only once, no rollback, no preread. So let’s try between the outermost layers.
The first is the simplest constant and function parsing. Start with a letter and contain only alphanumeric characters and underscores (_). We can extract the following characteristics :(constant and function names are collectively called identifier ids for the sake of description)
- The first letter encountered must begin with an ID.
- The first non-alphanumeric underscore encountered thereafter must indicate the end of the ID.
Then the algorithm will follow. To improve the id cobbling efficiency, I’ve externally defined a StringBuilder for splicing. Of course, there are other schemes, such as recording the first and last indexes of the ID and extracting them all at once.
The next step is to identify numbers, including decimals. Numbers can contain numbers and decimal points. Similarly, a number or decimal point encountered. Represents the beginning of a number, and the first non-number and non-decimal point encountered after that indicates the end of the number.
We can use the id method above to record the numeric string, and finally call the Kotlin library function to convert it to a double. But there’s a caveat here: the library has to scan the string when it’s converted, and we’ve already scanned it, so we’re duplicating the work. To do this, I define a double variable num=0, and a double fraction=1 that represents the number of decimal places. When the number is scanned, add num *10 to the current number. If it is after the decimal point, fraction / 10 is first multiplied by the current number and then added to num.
In this way, we can convert strings to numbers while scanning.