While learning Lua recently, I discovered that there is no built-in regular expression function in Lua. Instead, I invented a kind of regular expression rule — Pattern Matching to support similar functions. The main reason for this is program size: a POSIX-compliant regular expression implementation is larger than 4,000 lines of code, larger than the entire Lua library, compared with less than 500 lines of pattern matching in Lua. Pattern matching isn’t as powerful as regular expressions, but it’s good enough for most scenarios. This is also a “space for time” idea — programmers’ learning time. This article will use some examples to help you understand pattern matching in Lua.

Quick preview

The following example program extracts the date portion from the string S and prints it:

s = "Deadline is 30/05/1999, firm"
date = "%d%d/%d%d/%d%d%d%d"
print(string.match(s, date))   -- > 30/05/1999
Copy the code

The string data is the “pattern” used for matching. %d is a number that tells string.match: match the dates in string S in the format dd/mm/ YYYY.

Pattern composition

character classes

The usage similar to %d, called character classes in Lua, is listed as follows:

character class matching
%a The letter
%c Control characters (e.g. Carriage return :\n)
%d 0-9 Numbers
%l Lowercase letters
%p punctuation
%s White space characters
%u The capital letters
%w Letters and numbers
%x hexadecimal
%z Null character \ 0 “”

Use capitalization of the character class to indicate a complement to the type. For example, % A represents all non-alphabetic characters:

s, _ = string.gsub("hello, up-down!"."%A".".")
print(s) --> hello.. up.down.
Copy the code

String. Gsub function procedure: matches “hello, up-down! All non-character characters (%A) in and replace them with “.”.

magic characters

At this point, we see that the Character class can only match one character, which is very limited. Lua also introduces regular expression like syntax, which allows complex matching with special characters called Magic characters. These characters are listed as follows:

magic character describe note
(a) Marks a subpattern for later use, similar to the use of a re
. Match all characters Similar to regular expression, but of regular expression. Not including carriage return “\n”
% 1. Can be used as an escape character.

2. Statementcharacter classess;

3. Combined with () for subpattern matching: %N, where N is a number that matches the NTH substring, e.g. %1,%2… , and regular expression
1 . 1,
2… Similar.
+ Matches the previous pattern 1 or more times
Matches previous patterns 0 or more times, returns the shortest matching result, “non-greedy mode” in pattern matching
* Matches the previous pattern 0 or more times
? Matches the previous pattern 0 or 1 times
^ If it is at the beginning of the pattern, it matches the beginning position of the input string. If it is in [], it takes the complement, which is basically the same as the regular expression
$ Matches the end of the input string, similar to the regular expression
[] Represents a set of characters (char-set), similar to re usage For example,[%w_] matches letters, digits, and underscores, and [a-f] matches letters A to F

The sample

In Lua, the main functions that use pattern matching are:

Function signatures The basic purpose
string.find(string, pattern [, init, plain]) Looks for the first instance in the string that matches the Pattern and returns its start and end positions
string.match(string, pattern [, init]) Returns the first instance of the string that matches the Pattern
string.gmatch(string, pattern) Returns an iterated function that returns the next matched instance each time it is called
string.gsub(string, pattern,repl [, n]) Replace all strings in the string matched by pattern with repl, and return the replaced string

Parameters in [] are optional. Please refer to the official documentation for detailed function descriptions.

www.lua.org/manual/5.1/…

We’ll walk through several sample programs to demonstrate how pattern matching can be used.

Basic example

Match for many times

-- + Matches 1 or more times
print(string.match("hello world 123"."%w+ %w+")) --> hello world
print(string.match("helle world 123"."[%w %d]+")) --> hello world 123
-? Matches 0 or 1 time (matches a signed number)
print(string.match("the number is: +123"."[+ -]? %d+")) -- > + 123
print(string.match("the number is: 123"."[+ -]? %d+")) -- -- > 123
print(string.match("the number is: -123"."[+ -]? %d+")) -- -- > 123
-- * Matches 0 or more times
print(string.match("abc123abc"."%a+%d*%a+")) --> abc123abc
print(string.match("abcabc"."%a+%d*%a+")) --> abcabc
Copy the code

use[]combination

-- Matches all alphanumeric and "."
print(string.match("a1.a2+a3"."[%w.]+")) --> a1.a2
%d+ = %d+ = %d+ = %d+ = %d+ = %d+ = %d+
print(string.match("abc123"."[0-9] +")) -- -- > 123
%w+ = %w+ = %w+ = %w+ = %w+
print(string.match("abc123"."[a-fA-F0-9]+")) --> abc123
-- Matches 0 and 1
print(string.match("1010345"."[01] +")) -- -- > 1010
-- ^ Inverse, match all characters except letters
print(string.match("abc123"."[^%a]+")) -- -- > 123
Copy the code

use(a)and%NTo capture subpatterns

The following example uses () to capture a subpattern and returns a match for the corresponding subpattern:

date = "17/7/1990"
_, _, d, m, y = string.find(date."(%d+)/(%d+)/(%d+)")
print(d, m, y)  -- > 17 July 1990
Copy the code

The example is easy to understand: we define three subpatterns in the pattern using (), and string.find returns the matching result of each subpattern as a return value. Using () with %N, you can also use the value of the captured subschema directly in the schema, for example, extracting the part of a string wrapped in double quotes “or single quotes’ by doing this:

s = [[then he said: "it's all right"!]]
a, b, c, quotedPart = string.find(s, "([\ '])" % 1 ". (-))
print(quotedPart)   --> it's all right
print(c)            -- -- >"
Copy the code

The pattern ([\”‘])(.-)%1 contains two subpatterns. %1 represents the first matched subpattern (return value C), which is the matched value of ([\”‘]), in this case, the double quote “. In this example, you cannot simply use a pattern like [\”‘].-[\”‘], because obviously this pattern only matches it.

Non-greedy matching

+, *,? Similar to greedy matching in regular expressions, the matching rules of “always match as many matches as possible”. Regular expression usage? To represent the non-greedy qualifier. In Lua,? Does not have such a feature, but uses – to indicate non-greedy matches, for example:

print(string.match("<span>hello</span><span>world</span>"."<span>.+</span>"))
--> <span>hello</span><span>world</span>
print(string.match("<span>hello</span><span>world</span>"."<span>.-</span>")) 
--> <span>hello</span>
Copy the code

We want to match the content between the HTML elements span, and in the first pattern, since the.+ is a greedy match, it matches directly before the end , which is helloworld. In the second mode, the.- matches as few characters as possible, which is called non-greedy mode (lazy mode) and stops matching after the first . Observe the result of the match with string.gusb:

Lua  XXX 
Greedy matches, replacing the entire string
print(string.gsub("<span>hello</span><span>world</span>"."<span>.+</span>"."<span>lua</span>"))
--> <span>lua</span> 

-- Non-greedy match, the result is as expected
print(string.gsub("<span>hello</span><span>world</span>"."<span>.-</span>"."<span>lua</span>"))
--> <span>lua</span><span>lua</span>
Copy the code

Note that – at the beginning of a pattern is meaningless because – means zero or more matches, and if placed at the beginning it means zero matches forever.

string.gmatch

String. match matches only the first string in the string that matches the pattern. If we want all substrings in the string that match the pattern, we can use String. gmatch.

for i  in string.gmatch("<span>hello</span><span>world</span>"."<span>(.-)</span>") do
    print(i)
end
Copy the code

Output result:

hello
world
Copy the code

.- (.-) This causes the iterator returned by String.gmatch to iterate directly over the captured subpattern, i.e. (.-). If the original pattern is used, the output is:

<span>hello</span>
<span>world</span>
Copy the code

This is a feature of gmatch: if () is present in pattern, the iterator will only iterate over the substring matched by (); if not, it will return all matched strings. Using this feature, you can also read key-value pairs like key=value:

s = "from=world, to=Lua"
for k, v in string.gmatch(s, "(%w+)=(%w+)") do
    print("key="..k)
    print("value="..v)
end
Copy the code

Output result:

key=from
value=world
key=to
value=Lua
Copy the code

string.gsub

String. gsub is used for matching and replacing strings, using the repl argument to replace all (or the first n, if given by passing a third argument) strings in the string that match pattern, where repl can be a string, a method, or a table:

x = string.gsub("hello world"."(%w+)".1% "% 1")
--> x="hello hello world world"
--%1 represents the string matched by (%w+), i.e., "Hello world".

x = string.gsub("hello world"."%w+"."% 0% 0".1)
--> x="hello hello world"
The third argument replaces the string matching the first one, which is "hello".
For the first match, $0 $0 = "hello hello".
So the result is "Hello hello world".

x = string.gsub("hello world"."%w+"."% 0% 0".2)
--> x="hello hello world world"
The third argument replaces the first two matched strings, namely "hello" and "world".
For the first match, $0 $0 = "hello hello".
For the second match, $0 $0 = "world world"
So you get "Hello hello world world"

x = string.gsub("hello world from Lua"."(%w+)%s*(%w+)".2% "% 1")
--> x="world hello Lua from"

x = string.gsub("home = $HOME, user = $USER"."%$(%w+)".os.getenv)
--> x="home = /home/roberto, user = roberto"
The % in %$is escaped to match $
When the second argument is a function, this function is called every time a pattern is matched, taking the subpattern matched by () as an argument
The function returns the value as a replacement string.
-- In this example, os.getenv("HOME") and os.getenv("USER") are called and replaced with the original string based on the returned value.

x = string.gsub("4+5 = $return 4+5$"."% $(. -) % $".function (s)
     return loadstring(s)()
   end)
--> x="4+5 = 9"

local t = {name="lua", version="5.1"}
x = string.gsub("$name-$version.tar.gz"."%$(%w+)", t)
- > x = "lua 5.1. Tar. Gz." "
Copy the code

Perform process disassembly

String. gsub is very rich, but it’s always the same. Once you understand how this works, it’s easy to follow the example above. The execution of String.gsub is a pattern matching + string replacement cycle until the pattern matching stops. The process pseudocode is represented as follows:

repeatPerform pattern matching According to the REPL rules, the string matched by the loop pattern is replaceduntil(Mode matching ends)Copy the code

If we don’t pass the third argument, string.gsub matches the string exactly by default. If we don’t need to match the string exactly (such as String.match and String.find, which match only the first result and end up), We can specify how many times to match with a third parameter — that is, when the pattern match ends. In the current match-replace loop, the repL senses only the result of the pattern match: for example, the substring captured by %N is the substring of the second match, and if repL were a function, the function’s input parameter would be the second matched string, and the return value would replace only the second matched string.

reference

www.lua.org/manual/5.1/… www.lua.org/pil/20.1.ht… www.fhug.org.uk/wiki/wiki/d… www.jianshu.com/p/f141027e1…