The introduction
In the course of writing my article on the Basics of Regular Expressions, which introduced me to the basics of regular expressions, a question occurred to me: How exactly does a regular expression match a given string? Of course, just thinking of its matching mechanism as “magic” doesn’t work at all. After a bit of searching and literature reading, I’ve come up with some ideas, and I hope this article will give you some inspiration if you’ve ever thought about similar questions.
In this paper, Regular Expressions, finite automata, nondeterministic finite state automata, deterministic finite state automata, And Thompson’s constructional method are studied And based on this algorithm, I have implemented a regular expression parser. I use Python as the implementation language (because Python is simple enough to understand). But the idea behind the algorithm is applicable to any language.
Regular Expressions
Regular expressions are used to describe collections of strings, providing pattern/rule-based matching in addition to ordinary search capabilities.
- Except for the character *? | + (), all characters of its own matching for special characters, you need to use the escape character \ escaped
- If e1 match a, e2 matching b, then e1e2 match ab, e1 | e2 to match a or b
- Character *? + is the quantity modifier, e* means match 0 or more e, e+ means match 1 or more E, e? Matches 0 or 1 e
- Operator precedence for: select (e1 | e2) number (e1e2) < < combinations modifier (*? +)
That’s a brief overview of regular expression syntax. Modern regular expressions add many features, but to simplify our discussion, we’ll skip the list. In fact, to focus on the idea of the algorithm itself rather than the complex implementation of regular expressions, Also in this document are only for the above grammar achieve our regular expression parser (of course, is also in order to simplify the implementation, it is conceivable that the realization of the complete is very complex, and this article is to make readers preliminary understanding the principle behind the regular expression, if without limit, the complexity of a departure from the original intention of himself).
Finite Automata
Finite automata are another way of describing a set of strings. Finite automata consist of a finite set of internal states and a set of control rules that determine the next state from the current state and the next input signal. Among them, nondeterministic finite-state automata (NFA) and deterministic finite-state automata (DFA) are both subsets of finite automata, and later we will find that DFA is also a subset of NFA.
Mathematical description
DFA can be expressed as a quintuple:
Among them:
- III is for all input characters, which are limited
- SSS is the state set of NFA, and each element is called state
- FFF is a transformation function, which defines a single-valued mapping from S∗I→SS * I \rightarrow SS∗I→S, that is, F (P,a)=qf(p,a)=qf(p,a)=q, that is, when the current state is PPP and the input signal is AAA, it will be transformed to the next state QQQ
- QQQ is the termination state (acceptance state), which is a subset of S
- S0s_0s0 is the initial state
The mathematical definition of NFA is almost the same as that of DFA. The difference between NFA and DFA is that the transformation function FFF is the mapping of S*I \rightarrow P (S)\, where P (S)p(S) is the power set of SSS (i.e., the set of all subsets of SSS), F (p, a) = {q1, q2,… qk}f(p,a)=\{q_1,q_2,… q_k\}f(p,a)={q1,q2,… Qk}, if the current state is PPP and the input signal is AAA, the state converted to is a set of states.
To put it simply, the difference between NFA and DFA is that after a character is entered, the state after NFA transformation may not be unique and needs to be further selected, while DFA is unique, as the following example also illustrates.
example
For the regular expression A (bb)+a, the finite automaton can be expressed as:
Note that s0s_0s0 represents the start state, s4s_4S4 represents the end state, the end state is represented by a circle, and the remaining states are represented by a circle.
At this point, only one state corresponds to each state of the automata, so the automata is DFA and deterministic.
When DFA matches the input string, read one character at a time, and change the automaton from one state to another state according to the arrow. Assume that the input string is ABbbba and the initial state is S0S_0s0. After reading character A, the state changes to S1S_1S1 and read character B to s2S_2S2. Read B becomes S3S_3S3, read B becomes S2s_2S2, read B becomes S3S_3S3, and read A becomes S4s_4S4. Since S4S_4S4 is in the terminated state, the regular expression successfully matches the string, and the reading process is as follows:
When using an automata to match an input string, there are three cases:
- After the string is read completely, the automaton stops in the terminated state and the match succeeds.
- After the string is read completely, the automaton stops in the non-terminating state and the matching fails.
- In the process of reading the string, the automaton has no pointer to the next state, but there are still characters unread, and the match fails.
We can also convert the DFA into an NFA:
At state S3s_3S3, the automaton has two choices: go back to s1s_1S1 or continue to S4s_4S4, where state S3S_3S3 corresponds to state set {s1, S4}\{s_1, s_4\}{s1, S4}, so the automaton is nondeterministic.
Thompson’s Construction method
Since regular expressions and finite automaton can mean the collection of the string, should mutual transformation between them, in fact, did a lot of the regular expression into the algorithm of finite automaton, the next I will introduce the method to construct a – Thompson, this method can convert a regular expression to the corresponding non-deterministic finite state automata (in fact, Regular expressions can also be transformed into corresponding deterministic finite state automata.
The NFA representation of a regular expression is combined by the NFA construction of part of the regular expression. The difference is that the NFA corresponding to part of the regular expression has no termination state. In fact, it contains one or more Pointers that have not yet pointed to any state. The system will connect the unconnected Pointers in the NFA corresponding to some regular expressions according to the corresponding logic of the operators. After the construction, the system will connect all the unconnected Pointers to the terminated state to obtain the NFA model corresponding to the regular expression.
For the syntax of the regular expression described above, the corresponding NFA representation is shown below.
A. Single-character a is:
B. The connection e1e2 of the regular expression is expressed as:
C. the selection of the regular expression e1 | e2 is expressed as:
Note that when said e1 | e2, s need to create a new state, this is because according to the definition of the NFA, all the NFA must have an initial state.
D. Quantity modifiers for regular expressions? Represented by:
E. The quantity modifier * of the regular expression is:
F. The quantity modifier + of the regular expression is expressed as:
By observing the above representation, it can be seen that the maximum number of states of NFA is the length of the regular expression.
Regular Expression Search algorithm (Multi-State Simulation)
When using a regular expression to search for an input string, we first convert the regular expression to the corresponding NFA, and then use that NFA to match the input string. During the matching process, we have two strategies.
Since NFA is non-deterministic, that is, it is possible to encounter branching cases, in which the system needs to decide which branch to go to, an easy way to think about is backtracking. The basic idea of backtracking is to go to the first branch first, and if the result is wrong, go back and try another branch. In the worst case, there may be two branches in each step, so the time complexity of backtracking method is O(n−>2n)O(n->2^n)O(n−>2n), where NNN is the number of states in NFA. With the increase of the complexity of regular expression, there is no doubt that the matching speed will decrease greatly.
Another efficient strategy is to test multiple branches at the same time and record the states that meet the requirements and store them in the list. After traversing the input string, all the states in the list are traversed. If there is any termination state, the NFA is considered to match the input string. In this multi-state simulation strategy, every state in NFA is traversed at most in each step of matching, so the time complexity is O(N −> N)O(N -> N)O(N −> N). Compared with backtracking algorithm, this algorithm greatly improves the time efficiency, which is also the algorithm to be implemented in this paper.
implementation
I will implement the above multi-state simulation strategy in Python. For the full source code, see Python Regular Expressions Implementations. Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby…) The realization of the idea.
1. Convert the regular expression to NFA
According to the Thompson construction method above, the states in NFA are abstracted as:
class State(object) :
def __init__(self, id, c) - >None:
Id: indicates the unique ID of state. C: indicates the character of state. When c<256, it indicates the character to be matched, when c=256, it indicates the SPLIT state, and when c=257, it indicates the MATCH state. Only used when c is SPLIT, where out and out1 are used to join two different choices (states) lastList: The list of states used to identify where the state is, used at run time to prevent repeated addition of state, is in fact an NFA(fragment), just as a tree node is a tree, except that when C <=256, out/out1 does not connect to the MATCH state, but instead has one or more disconnected exits. To None "'
self.id = id
self.c = c
self.out = None
self.out1 = None
self.lastlist = 0
Copy the code
According to the character C in State, State can be divided into the following three types:
According to Thompson’s paper, before converting regular expressions to NFA, you need to convert regular expressions to postfix expressions and use operators. Said connection operator (i.e. e1e2 = > e1e2.), class method under RexExp re2post completed this step, such as regular expressions, a | c (b) d the postfix expression is: ABC |. D..
When parsing this postfix expression, we will maintain a list of NFA fragments corresponding to partial regular expressions, where each fragment is represented as:
class Frag(object) :
def __init__(self, start, out=None) - >None:
Start: indicates the start state of the NFA fragment out: Represents all unconnected states in the NFA fragment. Since Python is passed by value, pass tuples (out, outtype), Outtype (s, 'out') for s.out (s, 'out') for s.out (s, 'out1') for s.out1 at the end of NFA generation, these unconnected connections should be made to the MATCH State Frag class to simplify NFA generation. You can run much faster by caching all out of the NFA fragment, rather than walking through the state list each time to get unconnected states.
# start state
self.start = start
# * list of out state item: (out, outtype)
# * eg: (s, 'out') -> s.out
# * (s, 'out1') -> s.out1
self.out = out
Copy the code
It has been explained in the Section of Thompson construction method that the NFA corresponding to some regular expressions has no termination state. In fact, it contains one or more Pointers that have not yet pointed to any state. Therefore, start here is the start state of NFA fragment, and out is the pointer of all unconnected states. It may be s.out or s.ut1. During construction, the system automatically determines the logic of the operator or its termination state so that all out Pointers to the NFA fragment point correctly to the corresponding value.
Some helper functions:
def singletonlist(self, item) :
Item: (state, outType) where outtype takes 'out' and 'out1' respectively to indicate the out pointer and out1 pointer of state.
return [item]
def patch(self, out, state) :
Out: all unconnected Pointers to the NFA fragment state: some state s This function is used to connect all exits of the segment to state.
for item, out_type in out:
if out_type == 'out':
item.out = state
else:
item.out1 = state
def append(self, out1, out2) :
Out1: all unjoined Pointers to NFA fragment 1 Out2: all unjoined Pointers to NFA fragment 2 This function is used to merge unjoined Pointers to NFA fragment 1 and unjoined Pointers to NFA fragment 2.
return out1+out2
Copy the code
Singletonlist is used to generate a pointer list containing only item, Patch is used to connect all unconnected pointer list OUT of NFA fragment to state, and Append is used to merge unconnected Pointers of two NFA fragments.
With the above definition, we just need to follow the steps of Thompson’s construction method to construct NFA exactly, which is implemented by the function post2nFA.
A. For the join operator. :
if item == '. ': #! Catenate, this parser will. As an explicit connector
e2 = frags.pop()
e1 = frags.pop()
self.patch(e1.out, e2.start)
frags.append(RegExp.Frag(e1.start, e2.out))
Copy the code
B. to select operator | :
e2 = frags.pop()
e1 = frags.pop()
s = RegExp.State(self.get_state_num(), RegExp.SPLIT)
s.out = e1.start
s.out1 = e2.start
frags.append(RegExp.Frag(s, self.append(e1.out, e2.out)))
Copy the code
C. For quantity modifiers? :
e = frags.pop()
s = RegExp.State(self.get_state_num(), RegExp.SPLIT)
s.out = e.start
frags.append(RegExp.Frag(
s, self.append(e.out, self.singletonlist((s, 'out1')))))
Copy the code
D. For the quantity modifier * :
e = frags.pop()
s = RegExp.State(self.get_state_num(), RegExp.SPLIT)
s.out = e.start
self.patch(e.out, s)
frags.append(RegExp.Frag(s, self.singletonlist((s, 'out1'))))
Copy the code
E. For the quantity modifier + :
e = frags.pop()
s = RegExp.State(self.get_state_num(), RegExp.SPLIT)
s.out = e.start
self.patch(e.out, s)
frags.append(RegExp.Frag(
e.start, self.singletonlist((s, 'out1'))))
Copy the code
F. For single characters:
s = RegExp.State(self.get_state_num(), item)
frags.append(RegExp.Frag(s, self.singletonlist((s, 'out'))))
Copy the code
After parsing the postfix expression according to these steps, all Pointers that are still unconnected will be joined to the termination state (MATCH), so as to obtain the NFA model corresponding to the regular expression.
2. Use the regular expression search algorithm to match the input string
When using NFA to match the input string, we construct two lists: cSTATES, the current successful match status list, and NStates, the successful match status list after reading the next character. Where cSTATES is initialized to the input NFA.
def match(self, start, input_str) :
Start: NFA input_str: string to be matched This function matches the input string with the successfully constructed NFA and returns True on success.
cstates = [] #! current states
nstates = [] #! next states
Build the initial state set from NFA
cstates = self.start_states(cstates, start)
for c in input_str:
self.step(cstates, c, nstates)
cstates = nstates
nstates = []
return self.is_match(cstates)
Copy the code
As each character of the input string is read, the system executes the step method, which loops through the CStates to get a list of all the states in the CStates that match the current character, and adds the state pointed to by the out pointer of each state to the NStates. After executing step, assign cStates to nStates and null nStates.
def step(self, cstates, c, nstates) :
Cstates: all NFA fragments successfully matched so far C: character to be matched nSTATES: Next state list, used to store all NFA fragments successfully matching C This function is used to match character C in all NFA fragments successfully matching so far. If a match is successful, the next state of the NFA fragment is added to the NStates list for matching the next character "".
self.list_id += 1
for state in cstates:
if state.c == c:
self.add_state(nstates, state.out)
Copy the code
The add_state method is used to add a state to the states list. This is also the function of the lastlist attribute in the state class. Each time step is executed, the system creates a new state list and incrementing listid by 1. By judging the lastList attribute of a state, we can determine whether the state is added to the States list. Meanwhile, according to the regular expression search algorithm described above, we need to determine whether the state is a branch state. If it is a branch state, It is considered that the add_state method should be executed on each of its branch states.
def add_state(self, states, state) :
States: list of NFA fragments State: State to be added/NFA This function is used to add state to states. Note 1. Empty/existing states cannot be added repeatedly 2. SPLIT states should be added with two SPLIT states.
# state is empty or state is already in states
if state is None or state.lastlist == self.list_id:
return
state.lastlist = self.list_id
if state.c == RegExp.SPLIT:
Add two branch states
self.add_state(states, state.out)
self.add_state(states, state.out1)
return
states.append(state)
Copy the code
Finally, the is_match method is used to determine whether there is a stop state in the list of the last matched states after traversing the input string. If there is a stop state, it is considered as a successful match, otherwise it fails.
def is_match(self, states) :
This function is used to check whether the remaining valid NFA fragments are in the MATCH state after iterating through the input string. If an NFA fragment is in the MATCH state, it is successfully matched.
for state in states:
if state.c == RegExp.MATCH:
return True
return False
Copy the code
Convert NFA to DFA
Deterministic finite state automata (DFA) can also uniquely represent a regular expression, and it is often superior to NFA in efficiency because there is no branching case.
In fact, any NFA can be transformed into a corresponding DFA representation, where each state of the DFA is represented as a list of states of the NFA.
For example, for regular expressions abab | abbb, the NFA is expressed as:
When converted to the corresponding DFA, it can be expressed as:
In a sense, the search algorithm described above is equivalent to the implementation of DFA, in which CSTATES is equivalent to a state in DFA, and a new DFA state meeting the conditions is obtained every time the step function is executed.
Thompson’s basic idea of DFA simulation is as follows: After the STEP function is executed and the NStates are obtained, some data structure is used to cache them, so that it can be calculated on demand and repeated calculation can be avoided. In accordance with this idea, I will implement the search algorithm using DFA model to write regular expressions based on the NFA mentioned above.
First, a new data type, DState, is introduced to represent THE DFA state:
class DState(object) :
def __init__(self, states) - >None:
States: A state in the DFA corresponds to a set of states (fragments) in the NFA. Save all Pointers to the next state, size 256 (extended ASCII). If the current state is D and the input character is C, then d.ext [c] is the next state. If d.ext [c] is None, it is not connected yet. Tree right branch ""
self.states = states
self.next = [None for _ in range(256)]
self.right = None
self.left = None
Copy the code
DState is used to cache cStates in the NFA search algorithm, where the next list stores Pointers to all possible states: This regular expression parser matches only extended ASCII codes, so the length of the next list is 256. If the current state is D and the input character is C, then d.ext [c] is the next state after the input character c. If d.ext [c] is None, then the next state after the input character C is d. ext[c]. Call next_state to calculate the next state and cache it to d.ext [c]. Left and right Pointers are used to construct a binary search tree.
When DFA is used to match the input string, the system will traverse the characters in the input string to determine whether the current state d.ext [c] is None. If it is None, the next_state method will be called to generate the DFA state. The implementation of the match method is as follows:
def match(self, start, input_str) :
Start: DFA input_str: input string This function is used to match the input string "" with DFA.
for c in input_str:
index = ord(c)
if start.next[index]:
next_state = start.next[index]
else:
next_state = self.next_dstate(start, c)
start = next_state
return self.is_match(start.states)
Copy the code
The system takes the sorted state list in NFA as the key, constructs a binary search tree, and obtains the corresponding DState by passing the NFA state list.
def get_dstate(self, states) :
This function takes the set of states as the key and gets the corresponding DState of the set.
# sort
states.sort(key=lambda state: id(state))
# loop the current Dstate
dstate = self.dstate
while dstate:
cmp = self.states_cmp(states, dstate.states)
if cmp < 0:
dstate = dstate.left
elif cmp > 0:
dstate = dstate.right
else:
return dstate
# new dstate
dstate = RegExp.DState(states)
return dstate
Copy the code
The next_state method obtains the current DState and input characters, obtains the next state list by executing step function, and then obtains the corresponding DState by get_state method, and then caches it to d.ext [c].
def next_dstate(self, dstate, c) :
Dstate: DFA fragment c: input character This function retrieves the next state of the DFA (i.e., the state set of the NFA) based on the current DFA and the input character.
nstates = []
self.step(dstate.states, c, nstates)
dstate.next[ord(c)] = self.get_dstate(nstates)
return dstate.next[ord(c)]
Copy the code
The start_dstate method takes NFA to initialize DState.
def start_dstate(self, start) :
Start: state(also known as NFA) This function gets the initial DFA based on NFA
states = []
return self.get_dstate(self.start_states(states, start))
Copy the code
Note that the state in this DFA is the cache of the result of step function. Therefore, in actual use, memory management is often added to prevent memory overflow, but this article is not introduced due to space limitations.
reference
- Regular Expression Matching Can Be Simple And Fast(but is slow in Java, Perl, PHP, Python, Ruby, …)
- Writing own regular expression parser
- Regular Expression Search Algorithm