Hello everyone, I am cold grass 😈, a grass system code ape 🐒. Intermittent hot blood 🔥, continuous sand carving 🌟. If you like my article, you can follow ➕ and like 👍, and grow up with me ~ wechat: hancao97
“This article has participated in the good article call order activity, click to see: back end, big front end double track submission, 20,000 yuan prize pool for you to challenge!”
prologue
Words to readers
A while ago, I published an article on the principles of Compilation (1) : Introduction to Compilation, which gave a rough introduction to the subject of compilation principles. I also thought of a question:
It’s probably not going to be a very good series of articles for this course, and there are a lot of people out there who are much better at it than I am at it.
So I made the decision: I’m a front-end, and most of the people reading my article are probably a front-end, so I might as well go ahead and combine some of the front-end applications or practices with compilation principles. So this article is a new practice in this series 🌟. If you like it, you can leave your likes 👍 or follow ➕. Your support is the biggest motivation for me to write 🔥.
In this article, I’m going to start with automata, extend it to the regular match we use, and finally I’m going to show you a simple regular match. If you have any questions or comments, please leave them in the comments section and I will read them carefully.
Gitignore: node_modules: : node_modules: : node_modules: : node_modules
The article Outlines
Foundation of Finite Automata: DFA and NFA
Analysis of regularity principle
Hand touch hand, take you to achieve a simple version of the regular engine
And article directory, if you don’t read the first chapter automaton science, also can jump straight to the second chapter began to read, but recommend full text reading, because the first chapter also dry, full of the first chapter is too blunt where I have an example or more ground in order to do the descriptions summarize ✨, for the full experience of thinking, Feel the process of my complete study and practice, belong to the regular that such as fireworks 🔥 gorgeous summer poetry will slowly spread out.
So let’s start with a little salt
Foundation of finite Automata
Chapter Introduction — Finite Automata (FA)
If you don’t understand the definition of features and forms, please go to the following part to summarize, to help you have a rough understanding of the definition of finite automata.
First of all, before we move on to the next topic, it’s important to know what a finite automaton is, which is also called a sequential machine.
Finite automata have the following characteristics:
-
The system has a finite number of states, and different states represent different meanings. According to the actual needs, the system can complete the specified tasks in different states.
-
We can assemble the characters that appear in the input string to form an alphabet. All strings that the system processes are strings on this alphabet.
-
In any state, the system reads a character from the input string and moves to the new state based on the current state and the character read in.
-
There’s a state in the system, which is the starting state of the system.
-
There are also states in the system that indicate that the string of characters it has read so far is a sentence of the language.
Formal definition of finite automata:
A finite-state automaton is a quintuple M=(Q, σ, δ, q0, F), where:
-
Q — a nonempty finite set of states. ∀ Q ∈Q, Q is called a state of M.
-
Sigma — Input the alphabet.
-
Delta — state transition function, sometimes called state transition function or shift function, delta: Q× σ →Q, δ(Q,a)=p.
-
Q0 — the start state of M, also called the initial state or the start state. Q0 ∈ Q.
-
F — the set of termination states of M. F is contained by Q. Given any Q ∈F, Q is called the termination state of M
Let me summarize:
Don’t get caught up in the list of definitions and features, but it’s pretty simple:
- First of all, finite state machines have a starting state. Here’s a less appropriate example: if you want to submit a work order, you go through the stages of approval by your direct supervisor, approval by your personnel, and approval by your boss. then
Submit the repair order
Is the starting state of the finite automaton of the work order system, corresponding to the quintupleq0
- Of course, in addition to the start state, we also have an end state, so for our process of submitting a work order, our end state is
The repair order success
orEnd of the work order
, you can find that this finite automaton has two end states, so the end states are a set, not necessarily only one end state, the end state is in the quintupleF
- And the word “finite” is important for our finite automata, which means it has a finite number of states, as we see in the picture below
Submit the repair order
.To be approved by superior leadership
.Subject to personnel approval
.Pending approval of the boss
.The repair order success
.The repair order failed
That’s our set of states, which is our quintupleQ
- So, as an automaton, how do we know what our next state is? So we actually need two things, one is the current state, and the other is the state transition function δ. For example, our current stage is personnel approval, and our input is actually personnel approval, through
State transition function (judgment of approval results)
And then we can get our next state, which could beThe repair order failed
orPending approval of the boss
.
I don’t know if you have a certain understanding of the definition after my example, in fact, it is very simple to sum up, in fact:
Finite automata = finite set of internal states + set of control rules
DFA – Determines finite automata
define
We have learned about finite automata in the previous paragraph, so what are the special points of finite automata? Let’s take a look at the definition of finite automata:
- Determine the finite automaton M as a quintuple M = (S, ∑, s0, f, Z)
- S: a finite set of states, each element of which is called a state;
- ∑: a finite alphabet in which each element is called an input character;
- S0 ∈S: Unique initial state (start state);
- F: State transition function :S×∑→ S, and a single valued function, f(Si,a)=Sk.
The current state Si, when the input character A, the automaton will uniquely transition to the state Sk, Sk is called a successor state of Si
; - Z⊆S: the end state set (acceptable state set, end state set) where each element is called the end state (acceptable state, end state) and Z can be null.
Let me summarize:
We can actually compare it with the formal definition above, and we will find some important points:
- Initial state unique
- The state transition function is a single valued function
- The termination state set Z may be null
For example
M=({S,U,V,Q}, {a, b}, f, S, {Q}), where f is defined as: f(S, a)=U f(S, b)=V f(U, a)=Q f(U, b)=V f(V, a)=U f(V, b)=Q f(Q, a)=Q f(Q, b)=QCopy the code
So we can draw the corresponding automatic state machine
The string accepted by the DFA
-
For any string T in the sum, t is said to be acceptable to DFA M if there is a path from the initial node to some terminating node, and the string of concatenated tokens on all the arc along this path is equal to T
-
An empty string is acceptable to DFA M if the initial state of DFA M is also the end state.
-
The total number of strings that DFA M can accept is L(M)
Again, for example, if an automaton looks like this:
Then: L(M1) = {aba, abaa, abab, abaab… } If you have found that the prototype of regular matching has 🌟
Certainty of DFA
So why do we say DFA is deterministic finite automata, deterministic where these two fonts are now?
- Initial state unique
- The state transition function F: S×∑→S is a single valued function, that is, for any state S ∈S, the input symbol A ∈∑, f(S, a) uniquely determines the next state.
How is DFA implemented in code
For example, the automata is shown in the figure:
In fact, we want to implement simple DFA automata can be implemented with swicth case
// Simply write a switch
switch (currentChar) {
case 'a':
goto Lj;
break;
case 'b':
goto Lk;
break;
default:
throw err;
}
Copy the code
NFA — non-deterministic finite automata
define
A nondeterministic finite automaton M is a quintuple M = (S, ∑, S0, f, Z)
- S: a finite set of states, each element of which is called a state;
- ∑: a finite alphabet in which each element is called an input character;
- S0 ⊆ S: non-empty initial state set;
- Z⊆S: termination state set;
- F: a state transition function that maps S×(∑∪{ε}) to a subset of S, that is, S×(∑∪{ε})→ 2^S
Note that the successor state here is not a single state, but a subset of the state set S, i.e. the transition function is not single valued.
The difference is that the result of the transition function is no longer a single value, the start state is a set instead of a definite value, and the input to the transition function is ∑∪{ε} which means that the label above the directed arc can be empty.
For example
NFA M = ({0.1.2}, {a, b}, f, {0}, {2}) State transition function is as follows:0,a)={0.1} f(1, a) = ∅ f (2,a)={2}
f(0,b)={0} f(1, epsilon) = {2} f(2,b)={2}
Copy the code
Then we can draw the corresponding finite automata
String accepted by the NFA
Let M be an NFA, M=(S, ∑, f, S0, Z), then L(M) be defined as the set of strings accepted from any initial state to any termination state. Let’s take the above automaton as an example. Automata to accept the above string collection is: L (M) = {beta | beta is like… a… A string consisting of a and b
For example: AAA, bab, abaa…
DFA was compared with NFA
So before we close this chapter, let’s review
The DFA:
- Start state unique
- The state transition function is a single – valued function
The NFA:
- The start state is a collection of states
- The result of the state transition function is a set
- The mark above a directed arc may be empty
More extensions
Tip: There will be a lot of content worth talking about here, such as NFA to DFA conversion, DFA simplification, etc., but because the content of the article is limited, and is not related to the topic of this article, interested people can leave a message, we continue to open pit. And I encourage you to learn more about yourself.
Analysis of regularity principle
How a regular expression engine works has never been clearer! , this article also has a lot of content to understand you can also go to the onlookers, I learned a lot from this article, the summary is very good.
We have already explained DFA and NFA, i.e., deterministic finite automata and non-deterministic finite automata. Based on the preceding information, you can already relate regular engine to automata, and regular engine can be roughly divided into two categories: DFA regular engine and NFA regular engine.
The DFA engine
For example
Let’s go straight to a simple example:
The regular expression is a[db]c. The string to be matched is ABC
The reason we’re using ‘[]’ here is because we’re going to implement ‘[]’ in Chapter 3, a simple regex engine.
Let’s start matching:
I don’t know if you understand, but let me describe the process of comparison:
- The first time, the character A is compared to the regular expression A
- After a successful match, the character B compares both b and D in the expression
- If the match succeeds again, the character C is compared with the regular expression C
- The match is successful
The important thing to note here is that the second match is b compared to b and D, so it’s probably going to consume more memory.
The characteristics of
From the example above, we can see some features of the DFA regular engine:
The text main
: Executes in text order, thus ensuring the determinism of the DFA re engineRecord all currently available possibilities
: Just like the second match in the previous example, b and D are compared at the same time, so it consumes more memoryEach character is checked only once
: improves execution efficiency because there is no backtracking to repeat the matching processFunctions such as backreferencing cannot be used
: Because each character is checked only once and only the current comparison value is recorded, some functions such as backreferencing and glancing cannot be used
The NFA engine
For example
The example is just the example, convenient for you to compare:
The regular expression is a[db]c. The string to be matched is ABC
Let’s start matching:
The NFA engine will record the position of the character before matching, and then select one of the possible states for matching. If the match fails, it will backtrack and enter the other branches for matching.
The characteristics of
From the example above, we can see some features of the NFA regular engine:
- Expression dominance: Execute one part of the expression. If no match is made, the other part will continue to match until the expression is matched.
- It’s going to record a certain location: We see when executing to
[db]
, the NFA engine records the position of the character and selects one to match first. - A single character may be checked multiple times: We see when executing to
[db]
When comparingd
After finding a mismatch, the NFA engine takes another branch of the expressionb
, while the text positionback, and matches the character ‘b’ again. This is why NFA engines are non-deterministic, which brings another problem: they may not be as efficient as DFA engines. - Can achieve backreference and other functions: because of the backtracking this step, so can easily achieve some functions such as backreference!
contrast
Chapter summary
At the end of this chapter, you have obtained all the pre-knowledge 🌟. Now we will use this knowledge to implement a simple regular, which is the focus of this article. Let’s move on to the next chapter, 📖
Hand touch hand, take you to achieve a simple version of the regular engine
Preface chapter
Function description:
We need to provide a method, testReg, that takes two arguments, one is the string STR to be verified, and the other is the regular expression reg, which returns a Boolean value, i.e., match or not
Regular expression rules:
- This regular expression needs to start with
^
At the beginning,$
At the end[]
Represents a character in the set of matched characters,[]
After can meet*
or+
*
Means to match0
A or0
More than one+
Means to match1
A or1
More than one
// Examples of regular expressions^ [123]*mn[ab]+cd$
^a[ab]+$
...
Copy the code
Warehouse address: js-regular
Thinking analytical
We encountered a problem, we need to think first, and then code after we have the idea, to avoid repeated changes resulting in code confusion and time waste.
So, let’s first think about what our goal is. Since automata are the topic of this article, there’s no need to keep us in a trap. Our first thought must be to convert regular expressions into automata and then match our strings with the help of this regular-matching automata.
So how do we convert a regular expression into an automaton? Here’s how I thought about it:
Graph TD regular expression --> sequence of independent units with matching meaning --> regular matching automata
Let me explain briefly. I’ll break the problem into two parts. First I need to parse the string and then convert it into a sequence of separate units with matching meanings, the TOKEN sequence. What is a sequence of independent units with matching meanings?
Here’s an example:
If the regular expression is ^[123]+[a]*3$, the rest can be divided into three separate units:
[123] +
[a]*
3
But I certainly wouldn’t just split into three strings, I’d still be three objects with meaning (for the sake of automata generation), but that’s for later.
Then we are going to have to match the meaning of the independent unit sequence name (I really don’t 🐶) into automata, since we all say, “I’ll use object represents each individual unit, the most simple way is to add next to the object properties, next, of course, may be an array, store all of the possible branches.
And then we’ll write another method to make the automata run.
Ok, say and do, next we will enter the code step by step display and interpretation, please think with me.
Entry method – testReg
// the entry function
const testReg = (str, reg) = > {
if(! reg.startsWith(A '^') | |! reg.endsWith('$')) {// it's not a right reg string
throw Error('the format mismatch! ');
}
const generator = getGeneratorStart(reg);
return isMatch(str, generator);
//console.log(matchStructure)
}
Copy the code
The entry method is pretty straightforward. You can see that I take two arguments here. The first STR is the string to be matched, and the second reg is the regular expression.
First, I verify that a regular expression that does not begin with a ^ and end with a $is invalid and is illegal. We then call the getGeneratorStart method to get the start state of the automaton, and then call the isMatch method to make a match to the string.
Get the automaton method – getGeneratorStart
// use reg to get generator and return start Pattern Object
const getGeneratorStart = (reg) = > {
const regStr = reg.slice(1, reg.length - 1);
const patternObjList = getPatternObjList(regStr);
const generator = getGenerator(patternObjList);
return generator;
}
Copy the code
This is another very short and straightforward method. The first step is to take a cut of the regular expression and cut off the end (remove the beginning ^ and the end $), leaving only the part that really works. We then call two more methods, getPatternObjList and getGenerator. These two methods are consistent with what I expressed in the analysis of ideas:
getPatternObjList
: input isregStr
Is a regular expression string. The output isA sequence of independent units with matching meaning
getGenerator
: The input is the output of the previous step, i.eA sequence of independent units with matching meaning
, the output isThe initial state of an automaton
.
Get the unit sequence method – getPatternObjList
// change reg String to Pattern Ojects and return list
const getPatternObjList = (regStr) = > {
const len = regStr.length;
let patternObjlist = [];
let isInCollection = false;
let collection = []; // used to store current collection
for (let i = 0; i < len; i++) {
const char = regStr[i];
if(! isInCollection) {//
if(char ! ='[') {
// single char object
patternObjlist.push({
isCollection: false.pattern: [char],
next: []})}else {
// char === [ we need to change isInCollection to true
isInCollection = true; }}else {
if(char ! ='] ') {
collection.push(char);
} else {
// ] is the sign end of collection
isInCollection = false;
// collectionSign maybe * or +
let collectionSign = regStr[i + 1];
let collectionType = 'COMMON';
if( collectionSign && collectionTypes.includes(collectionSign) ){
collectionType = collectionSign
i++;
}
patternObjlist.push({
isCollection: true.pattern: collection,
collectionType,
next: [] }) collection = []; }}}return patternObjlist;
}
Copy the code
This method is a long one, but it’s just a string iterating over and over again. It actually looks simple, but it’s worth noting the data structure I converted from a sequence of individual units with matching meanings:
[]
The data structure corresponding to the collection
{
isCollection: Boolean.pattern: Array.collectionType: emun,
next: Array
}
Copy the code
- The data structure corresponding to a normal string
{
isCollection: Boolean.pattern: Array.next: Array
}
Copy the code
Among them
pattern
It stores all possible matches, like[123] +
The pattern is[1, 2, 3]
collectionType
Store is*
or+
orCOMMON
To facilitate the subsequent generation of automatic machine processing
Let me show you the input and output of the method:
Input: ^ [123]+[a]*3$output: [{isCollection: true.pattern: [ '1'.'2'.'3'].collectionType: '+'.next: []}, {isCollection: true.pattern: [ 'a'].collectionType: The '*'.next: []}, {isCollection: false.pattern: [ '3'].next: []}]Copy the code
The unit sequence is converted to an automaton method – getGenerator
// change pattern list to regular generator
const getGenerator = (patternObjList) = > {
patternObjList.push({
isEnd: true,})// the end signal of generator
let start = {
isStart: true.next:[]
}; // generator need a 'start' to start valid
const len = patternObjList.length;
start.next = getNext(patternObjList, -1);
for(let i = 0; i < len; i++ ){
const curPattern = patternObjList[i];
curPattern.next = getNext(patternObjList, i)
if(collectionTypes.includes(curPattern.collectionType)){ curPattern.next.push(curPattern); }}return start;
}
Copy the code
Let’s start by adding the start and end states to the array of values returned by the getPatternObjList method. We then initialize next for the initial state, then loop through the array to initialize next for each item in the array, thus concatenating the states of the automaton through the pointer stored in next.
Note: Here each item in the next array is a reference to an object in the patternObjList array. And finally, append yourself if collectionType is * or +. This type of node can be self-looping
Then we’ll look at one of the sub-methods, getNext, which I won’t do in a separate chapter because the two methods are closely related.
// get PatternObj's next
const getNext = (patternObjList, index) = > {
let next = [];
const len = patternObjList.length;
for(let i = index + 1; i < len; i++){
const nextPattern = patternObjList[i]
next.push(nextPattern)
if(nextPattern.collectionType ! =The '*') {// * need to handle, * is possible no
break; }}return next;
}
Copy the code
The key thing is to do *, because * means zero or more, so you keep going.
For example, in regular expressions such as a[b]*c, a may be followed by b or c by B
Finally we can look at the output of this automaton
Input: ^ [123]+[a]*3The $output:// There may be some problems with the output because of circular references, but you can also see through the structure
{
isStart: true.next: [{isCollection: true.pattern: [Array].collectionType: '+'.next: [Array]]}}Copy the code
Automata legend display
Input: ^[123]+[a]*3$
Verify the matching method – isMatch
// use generator to test string
const isMatch = (str, generator) = > {
if(generator.isStart){
// the start of recursive
for(const nextGen of generator.next){
if(isMatch(str, nextGen)) return true;
}
return false;
} else if(generator.isEnd){
// if generator is end but str is not end return false
return str.length ? false : true;
} else {
if(! str.length){return false;
}
if(! generator.pattern.includes(str[0]) {return false;
} else {
const restStr = str.slice(1);
for(const nextGen of generator.next){
if(isMatch(restStr, nextGen)) return true;
}
return false; }}}Copy the code
This is actually a recursive program:
- If the current state of the automaton is in the start state: No matching is performed, and cyclic matching is performed
next
If one of the branches matches successfully, it is a valid string - If the current state of the automaton is terminated: The judgment method passed in
str
Student: Is the length0
If it is0
It indicates that the string to be matched has been matched and is a valid string. Otherwise, it is invalid. - Other situations: matches whether the current character is in
pattern
Array, if in means that the current character matches, continue to loop matchingnext
If one of the branches matches successfully, it is a valid string
And so our code is done!
The output demonstration
Correct 🌟
The complete code
Convenient everyone copy paste or complete review, is very intimate ❤️
const collectionTypes = [The '*'.'+'];
// change reg String to Pattern Ojects and return list
const getPatternObjList = (regStr) = > {
const len = regStr.length;
let patternObjlist = [];
let isInCollection = false;
let collection = []; // used to store current collection
for (let i = 0; i < len; i++) {
const char = regStr[i];
if(! isInCollection) {//
if(char ! ='[') {
// single char object
patternObjlist.push({
isCollection: false.pattern: [char],
next: []})}else {
// char === [ we need to change isInCollection to true
isInCollection = true; }}else {
if(char ! ='] ') {
collection.push(char);
} else {
// ] is the sign end of collection
isInCollection = false;
// collectionSign maybe * or +
let collectionSign = regStr[i + 1];
let collectionType = 'COMMON';
if( collectionSign && collectionTypes.includes(collectionSign) ){
collectionType = collectionSign
i++;
}
patternObjlist.push({
isCollection: true.pattern: collection,
collectionType,
next: [] }) collection = []; }}}return patternObjlist;
}
// get PatternObj's next
const getNext = (patternObjList, index) = > {
let next = [];
const len = patternObjList.length;
for(let i = index + 1; i < len; i++){
const nextPattern = patternObjList[i]
next.push(nextPattern)
if(nextPattern.collectionType ! =The '*') {// * need to handle, * is possible no
break; }}return next;
}
// change pattern list to regular generator
const getGenerator = (patternObjList) = > {
patternObjList.push({
isEnd: true,})// the end signal of generator
let start = {
isStart: true.next:[]
}; // generator need a 'start' to start valid
const len = patternObjList.length;
start.next = getNext(patternObjList, -1);
for(let i = 0; i < len; i++ ){
const curPattern = patternObjList[i];
curPattern.next = getNext(patternObjList, i)
if(collectionTypes.includes(curPattern.collectionType)){ curPattern.next.push(curPattern); }}return start;
}
// use reg to get generator and return start Pattern Object
const getGeneratorStart = (reg) = > {
const regStr = reg.slice(1, reg.length - 1);
const patternObjList = getPatternObjList(regStr);
const generator = getGenerator(patternObjList);
return generator;
}
// use generator to test string
const isMatch = (str, generator) = > {
if(generator.isStart){
// the start of recursive
for(const nextGen of generator.next){
if(isMatch(str, nextGen)) return true;
}
return false;
} else if(generator.isEnd){
// if generator is end but str is not end return false
return str.length ? false : true;
} else {
if(! str.length){return false;
}
if(! generator.pattern.includes(str[0]) {return false;
} else {
const restStr = str.slice(1);
for(const nextGen of generator.next){
if(isMatch(restStr, nextGen)) return true;
}
return false; }}}// the entry function
const testReg = (str, reg) = > {
if(! reg.startsWith(A '^') | |! reg.endsWith('$')) {// it's not a right reg string
throw Error('the format mismatch! ');
}
const generator = getGeneratorStart(reg);
return isMatch(str, generator);
//console.log(matchStructure)
}
console.log(testReg('2131aa3'.'^[123]+[a]*3$'));
Copy the code
Chapter summary
In this chapter we use the knowledge we learned in the previous chapters to implement a simple regex 🌟, while the real regex engine is much more complex, and there are also processes such as precompilation that I haven’t seen yet. But articles brought in practice or in personal, I believe you and I together to finish the simple regular match after also can get some ideas to solve the problem, or a little more thinking, thank you to join me in this process, the experience may point ah 👍, or focus on ➕ give me more power, learning to grow together with you.
conclusion
How the regular expression engine works — it’s never been clearer! Thank you for your sharing. Thanks to my Alma mater Jilin University courseware ~ thanks to the author luo Zhu for sharing some special content 🐶 ~ thanks to the operation sister girl knight’s pillow, let me combat full ~
Finally, I am cold grass, a grass code ape 🐒, everyone like my article may wish to pay attention to ➕, praise 👍. Your support is my biggest, biggest, biggest motivation
The universe is uncertain, you and I are dark horse 🔥 scallion duck 🦆