Basic method

Start with the sentence ω, scan ω from left to right, repeatedly replace the left part of the production with the right part of the production, search for the match of ω, and finally get the beginning symbol of the grammar (or, find the error) (that is, make a tree from the bottom up, and finally push to the root of the beginning symbol)

At each step of the analysis process, we always try to find a substring in the sentence pattern that can be replaced by the left part of the production. So you go up and up, and you end up with an opening symbol.

And since we scan the token stream from left to right, our process of [trying to find substrings in sentence patterns that can be replaced by the left part of the production formula, and making constant substitutions] also goes from left to right. In this way, the inverse process of bottom-up analysis from the sign to the beginning sign is actually a most right derivation!

The reverse of the derivation is called the protocol. The inverse of the right-most derivation is the left-most specification.

Phrase, handle, specification, cut handle

The rightmost derivation is called canonical derivation, so as the leftmost specification of its reverse process, it is the canonical specification

Phrases, direct phrases, and handles

If S =*> αAδ, A =+> β, then β is said to be the pattern αβδ relative to A. If A→β, then β is said to be the pattern αβδ relative to the production of formula A→β. The leftmost direct phrase of a sentence pattern is called a handle.

A sentence pattern is a complete structure, while a phrase is a part of a sentence pattern targeted at a non-terminal. Therefore, the opening symbol S is a sentence pattern rather than a phrase.

Elements of phrase formation:

  • A can be derived from S, that is, S=*> αAδ;
  • At least one derivation from A leads to β, that is, A=+> β.

In my opinion (I think), one of the key points of specification is trying to figure out “what are the chaining characters that count as a symbol in a stream of symbols that are essentially strings? How many symbols do they match the right hand side of any production?” The problem. And the notion of phrase, direct phrase, handle, is really a scheme to describe the combinations and boundaries on the right side of the production — think of the phrase in natural language, A phrase in natural language is an ordered sequence of words made up of at least two words that are combined in order to express a common meaning — only a particular word in a particular sequence can be called a phrase. The same is true of phrases in grammatical analysis: a phrase in grammatical analysis is a sequence of symbols composed of up to a few terminals or (and) non-terminals, which is the right part of a production and can be reduced to the left part of a production when specified.

Reflected in the example defined above is:

  • β is the sentence structure αβδ relative to A, then αβδ can be reduced to αAδ by at least one step. αAδ can be reduced to the beginning symbol S after zero or more steps.
  • β is the [direct phrase] of the sentence structure αβδ relative to the formula A→β, then αβδ can become αAδ after A step of specification;

A phrase, not necessarily a sequence containing a terminal, can also be a single non-terminal. Handles change throughout the specification: after all, A phrase is, by definition, just A sentence pattern derived from at least one derivation of A non-terminal A from the opening symbol S. So in the following example, it must be possible to derive the non-terminal T2 from the starting symbol S, and T2 can also be expanded to F1 after a step of derivation. So, F1 is the T2 phrase (and the direct phrase, and the leftmost direct phrase, so F1 here is the handle to T2)

  • Phrase: all leaves from left to right in a subtree rooted in a non-terminal
  • Direct phrase: all the leaves from left to right in a parent-child tree (tree height 2)
  • Handles: All the left-to-right leaves in the leftmost parent-child tree (handles are unique)

Specification specification (leftmost specification)

If α is a sentence of grammar G and satisfies the following conditions, the sequence αn, αn-1… , α0 is a leftmost specification:

  1. Alpha n = alpha
  2. Alpha 0 = S (the beginning of G)
  3. For any I (0 < I <= n), αi-1 is obtained by replacing the handle in α I with a non-terminal character in the left part of the corresponding production.

Ex. :

Grammar: (1)S→aABe (2)A→b (3)A→Abc (4) b →d

abbcde <= aAbcde <= aAde <= aABe <= S

The procedure above, seen from left to right, is the left-most specification (called the specification specification), and from right to left is the right-most derivation

In fact, the above process is very brief, and there are some key details that need to be gradually understood later. For example, in steps 2 to 3, why is aAbcde not specified as aAAcde? The solution to this problem is to use the FIRST, FOLLOW collections, and push-down stacks mentioned in our previous blog.

Move into – protocol

Specification is the process of replacing a substring identical to the right of a sequence with a non-terminal character on the left of a production. Because the input sequence is scanned from left to right during the protocol, the most leftmost direct phrase (that is, the handle) in the sequence is searched each time during the protocol.

In the process of specification, what the parser needs to do is to scan the characters in the input symbol stream and treat them accordingly according to the condition of the scanned characters: if the specification can be carried out, other operations are carried out if the specification cannot be carried out.

Because obviously, specifications can’t be treated each character can, from left to right in the process of the actual code will appear temporarily unable to code (or wrong), such as the example above, the first row in the input sequence of a until the final step in the statute and sentence the other symbols are reduced to E together. This shows that in the process of specification, there will be a need to “put aside some symbols until the time is ripe for processing”.

That is, our parser needs to be able to write down characters in order first. During the later specification process, the symbols that cannot be specified for the time being are monitored by the parser, which needs to be able to decide whether to add the previously stored symbols to the specification. Character storage is still done using a stack — a stack is used to store [prefixes that will contain handles] and then, as input characters are read later, to determine whether the prefixes already placed in the stack can be specified.

The parser also relies on a table to determine whether it can be specified (when it can be specified, and which production to use). This table is called the move forward specification analysis table.

This is the method of moving into specification: use the stack to store the prefix of the input sequence that will contain the handle, use the analysis table to determine when the top of the stack has formed a handle, and determine the production to be used for specification.

Characters are continuously moved from the input sequence onto the stack until the handle is formed. When a handle is formed at the top of the stack, the string is specified as a non-terminal character.

The method of moving – specification is to continuously determine whether the prefixes of the scanned part of the input sequence can be successfully converted to the symbol sequence in the stack. Can the scanned input sequence be changed into a live prefix for a grammar (the word will be explained later).

Explanation: I think this parser analyzes grammar in the same way that we understand human speech: Every sentence we say contains at least one word, and some words are related to each other (that is, multiple words form a phrase to express a common meaning. At this point, these words form a phrase and need to be considered as a whole to understand). When we listen to other people speaking, our ears are like a read-write head reading symbols at every moment, following the output symbol stream of the speaker to read, and temporarily storing a word that the person has just said in his mind, ready to be associated with the new word heard next, and to understand the whole. We put half of what other people say into our brain and wait for the rest, that is, move in. We have listened to a paragraph of somebody else’s words, which can be understood as a whole and summarized into a meaning, which is the statute.

Move to the working mode of the analyzer

The model moved into the spec analyzer is very similar to the predictive analyzer, as shown in the figure below

The symbol state stack can form a pair of symbols and states-not only symbols, like the symbol stack, but also a state value associated with those symbols. Symbols and states are stored in pairs on the stack.

The moving-protocol analysis table tells us if we can do the protocol, and if so, by which generation. The types of this table are not unique. According to the analysis table, the moving protocol analyzer can be divided into operator priority analyzer, LR analyzer and other types. These parsers all belong to different implementations that move into the specification parsers.

Moving into a protocol analyzer works by converting from schema to schema step by step. The pattern here is also a triplet (# stack, current remaining input #, action to change the pattern). At each new schema, the parser looks up the table to determine the next action – whether to move to or to the specification, and if so, which generation to specify.

Grammatical analysis is to start from an initial pattern and go through a series of pattern changes, and finally reach the receiving pattern (success) or error pattern (discovery of grammatical errors).

  • Stack: store grammar symbols and states, and symbols and states appear in pairs;
  • Residual input: all sequences in the initial pattern; Residual input is empty when receiving pattern; Other patterns are suffixes of the original input sequence.
  • Action: The driver searches the analysis table to determine the corresponding action based on the current stack and remaining input, and then changes the stack and remaining input content to enter the next pattern based on these actions. Here are some game-changing actions:
    • Shift: To stack the terminator in the input sequence. This action occurs when no handle is formed at the top of the stack (that is, no protocol is available);
    • Reduce: When the handle is formed at the top of the stack, the handle at the top of the stack is replaced with the corresponding non-terminal character according to the production formula;
    • Accept: The parsing is successful;
    • Error: A syntax error is found and an error recovery program is called.

Example of moving specification analysis:

Abbcde was analyzed by the shift-specification method

If the sentence pattern is taken separately, it is a right-most derivation from bottom to top.

  1. In the leftmost protocol, the handle is always formed at the top of the stack: after a protocol, the parser must move 0 or more symbols to find the next handle at the top of the stack.
  2. The stack always contains a right sentence prefix that does not contain the symbol following the handle. This prefix is called a living prefix (a leftmost derivation is called if the leftmost non-terminal of the sentence pattern is replaced in each direct derivation. The sentence pattern from the leftmost derivation is called the left sentence pattern. (2).

Moving into protocol analysis, two key issues need to be addressed:

  1. How to determine whether a handle is formed at the top of the stack;
  2. How to select a production when forming a handle.

To put it bluntly, it is to resolve the question of when and with what statute

Construct the DFA for identifying live prefixes