4- Analytical – Theoretical analysis
In fact, we know exactly what “parsing” does, in short: “convert source code to machine code”. Aren’t you curious what’s involved in the process of transformation?
First, we show the above sentence – “source code into machine code”.
In a browser, most parsing is a document, and the whole process is an effort to translate the document into a structure that code can understand and use. The output is a tree of nodes that correspond to the structure of the document, often referred to as a parse tree or syntax tree.
For example, parsing 2 + 3-1 produces a syntax tree like this:
The whole parsing process is processed based on the grammar rules of the document, and each parsable format has a definite grammar, which is composed of consecutive words and grammar rules. Such grammar is called context free grammer, while human languages do not belong to this category (so they cannot use conventional parsing techniques). So the corresponding parsing is still a fairly rigid process, just matching the syntax and then transforming it.
How many stages can the parsing process be divided into?
The parsing process is divided into two stages, lexical analysis and syntax analysis.
Lexical Analysis stage: The whole process is to transform the input documents into valid tokens. Tokens are words in human language.
Syntax analysis stage: this stage is to apply syntax rules to the products of the previous stage and produce syntax tree.
The above process is based on the corresponding functions of the carrier are a Lexer and a parse respectively.
Lexical analyzer: Converts input documents into valid tokens that identify and eliminate invalid characters.
Parser: Converts valid tags into parse trees.
The entire parse process is circular. Parse continually asks for a token from the Lexer, and then attempts to find a grammar rule to hit. If it hits, the node corresponding to the token is added to the Parse tree. If not hit, first save to the internal; Then continue to ask for tokens to hit the syntax until all the tokens are hit with the syntax rule. If not, an exception is raised, which indicates that there is an exception in the document, because there is a syntax error.
Parsing into a syntax tree does not complete the process of converting the source code into machine code, leaving translation as the last step. The translation process is also the analytical process, which can be understood as the following flow chart:
To elaborate on the initial example: the 2 + 3-1 expression. This expression contains integers, plus signs, and minus signs.
The corresponding mathematical calculation syntax is as follows:
- The grammatical units that make up a language are expressions, items, and operators.
- The language we use can contain any number of expressions.
- An expression is defined as an “item” followed by an “operator” followed by an “item”.
- The operators are plus or minus.
- An item is an integer or an expression.
So parse the expression: the first string that matches the syntax rule is 2, which, according to syntax rule 5, is an item. The second substring that matches the syntax rule is 2 + 3, and according to rule 3 (an item followed by an operator, and then another item), this is an expression. The next match has reached the end of the input. 2 + 3-1 is an expression, because we already know that 2 + 3 is a term, which follows the “term followed by operator, followed by term” rule. 2 + + does not match any rule and is therefore an invalid input.
In the computer, the corresponding words are represented by regulars, so the words of the above expressions are expressed by regulars:
INTEGER: 0|[1-9][0-9]*
PLUS: +
MINUS: -
Copy the code
Since most parsed grammars conform to context Free Grammer, the BNF (Bacos normal form) rules are used. BNF is a set of symbols first introduced by John Backus and Peter Naur to describe the syntax of computer languages. The grammar rules are
< symbol > ::= < expression using symbols >Copy the code
The above examples can be defined as:
expression := term operation term
operation := PLUS | MINUS
term := INTEGER | expression
Copy the code
This is just the whole theoretical process of parsing, and its carrier is the parser, so what kind of parser?
Next article
Parsers – Parsers