By Amit Chaudhary
Click “like” and then see, wechat search ** [Big Move the world] pay attention to this person without dACHang background, but with an upward positive attitude. In this paper, making github.com/qq449245884… Has been included, the article has been categorized, also organized a lot of my documentation, and tutorial materials.
Recently open source a Vue component, is not perfect, welcome everyone to improve it, also hope you can give a star support, thank you.
Making address:Github.com/qq449245884…
In NLP, examining text against patterns or extracting content from text that matches a particular pattern is a common task. Regular expressions are a powerful tool for doing this.
Natural Language Processing (NLP) is a sub-field of artificial intelligence (AI).
Despite their power, regular expressions are often intimidating because of the many commands we need to remember and the logical ability required in complex structures.
In this article, we mainly illustrate the concepts of regular expressions by way of diagrams. The goal, of course, is to help you, myself included, build a mental model of regular expressions.
Mental models
Let’s start with a simple example where we try to find the word cool in the text.
With re, we just type the word ‘cool’ as a pattern and it matches that word.
'cool'
Copy the code
When the regular expression matches the word ‘cool’ we expect, it operates not at the word level but at the character level, which is a point we need to clear up.
Note: Regular expressions work at the character level, not the word level.
This means that the regular expression ‘cool’ will also match the following sentence.
Basic building blocks
Now that we understand the key points, let’s look at how to use regular expressions to match simple characters.
Specific character
We can specify characters in a regular expression that will match all instances in the text.
For example, the following regular expression will match all instances of ‘a’ in the text:
'a'
Copy the code
You can also use any number from 0 to 9 to match numbers.
'3'
Copy the code
Note that by default, regular expressions are case sensitive, so the following regular expressions do not match anything.
'A'
Copy the code
A space character
We can use special escape sequences to detect special characters, such as Spaces and newlines.
In addition to the common ones mentioned above, we have:
\r
enter\f
Change the page\e
Executable mode
Special characters
Regular expressions provide a bunch of built-in special characters that can match one set of characters at a time, starting with a backslash \.
Mode:\d
It matches numbers between 0 and 9.
Notice that the match is a single digit. So instead of a single number 18.04, we have four different matches below.
Mode:\s
It matches any space character (space, TAB, or newline).
Mode:\w
It matches any lowercase letters (a to Z), uppercase letters (a to Z), digits (0 to 9), and underscores.
Mode:.
It matches any character except newline (\n).
Let the STR = 'line 1 \ nline2 STR. Match (/. / g) / / result: [" l ", "I", "n", "e", ""," 1 ", "l" and "I", "n", "e", "2"]Copy the code
Pattern: Negation
If we use the uppercase form of the above pattern, it indicates their negative side.
For example, if \d matches any number from 0 to 9, \d will match any number other than 0 to 9.
Character set
Character set patterns start with [, end with], and match characters enclosed in parentheses. For example, the following pattern matches any character ‘A ‘, ‘e’,’ I ‘, ‘O ‘, and ‘u’.
We can also use the following mode to replace the functionality of \d.
Instead of specifying all numbers, we can also use hyphen conformance – just specify the start and end numbers. Therefore, we can use [0-9] instead of [0123456789] :
For example, [2-4] can be used to match any number between 2 and 4 (that is, 2 or 3 or 4).
We can use the special characters described above in parentheses. For example, match any numeric or whitespace character between 0 and 9:
Below, some common patterns and their meanings are listed.
The anchor
Regular expressions also have special handlers to make patterns match only at the beginning or end of a string.
We can only use the ^ character to match patterns that start with a specified start. For example,
Similarly, we can use the $character after a character to indicate the end of a specified character. Such as:
Escape metacharacters
Consider a case where we want to match the word “Mr. Stark” exactly.
If we want to match with Mr. Stark in this format, it will have an unexpected effect. Because we know that dot has special meaning in regular expressions.
Therefore, if we want to match exactly the character itself, we need to escape special metacharacters such as., $, and so on.
Below is a list of metacharacters, remember to escape them if you use them directly.
^ $. * +? {} [] \ | ()Copy the code
Repeat classes
Now that we can pattern match any character, let’s move on to a slightly more complex pattern.
Stupid way to match repeated characters
Using only what has been learned so far, the naive approach is to repeat the pattern. For example, we can match two digits by repeating the character-level pattern.
\d\d
Copy the code
quantifiers
Regular expressions provide special quantifiers to specify different repetition types of characters that precede them.
Fixed repeated
We can use {… } quantifiers to specify how many times the pattern should be repeated.
For example, you can change the pattern previously used to match two digits to:
We can also specify the repetition range using the same quantifier. For example, to match two or four digits, the following pattern can be used:
When applied to a sentence, it will match both 4 and 2 digits.
Note that there should not be any Spaces between the minimum and maximum counts, for example, \d{2, 4} does not work.
Flexible quantifiers
Regular expressions also provide quantifiers *, +, and? Use it to specify flexible repetition of characters.
?
The character indicates zero or one match
For example, suppose we want to match the words “sound” and “sounds”, where the “s” is optional. Can we use? Quantifiers.
+
Character indicates one or more matches
For example, we can use the re \d+ to find numbers of arbitrary length.
*
Character indicates 0 or more matches
Usage in Python
Python provides a module in the standard library called “re” to use regular expressions.
The need for raw strings
To specify regular expressions in Python, we create the raw string before r
pattern = r'\d'
Copy the code
To understand why we add r in front, we try printing the expression \t without **r**.
>>> pattern = '\t'
>>> print(pattern)
Copy the code
As you can see, string \t is treated as a TAB escape character in Python when we are not using the raw string.
Now we convert it to the raw string, and we’ll get whatever we specify
>>> pattern = r'\t'
>>> print(pattern)
\t
Copy the code
Using the RE module
To use the re module, we need to import it:
import re
Copy the code
1. re.findall
This function allows us to get all matches as a list of strings.
import re
re.findall(r'\d', '123456')
// ['1', '2', '3', '4', '5', '6']
Copy the code
2. re.match
This function searches for patterns at the beginning of the string and returns the first match as a match object. If the pattern is not found, None is returned.
import re
match = re.match(r'batman', 'batman is cool')
print(match)
// <re.Match object; span=(0, 6), match='batman'>
Copy the code
With the match object, we can treat the matched text as
print(match.group())
// batman
Copy the code
In cases where our pattern is not at the beginning of the sentence, we will not get any matches.
import re
match = re.match(r'batman', 'The batman is cool')
print(match)
// None
Copy the code
3. re.search
This function can also look for the first occurrence of a pattern, but the pattern can appear anywhere in the text. If the pattern is not found, None is returned.
import re
match = re.search(r'batman', 'the batman is cool')
print(match.group())
// batman
Copy the code
That’s it for today, and I’ll see you next time. Remember Triple Company.
The bugs that may exist after code deployment cannot be known in real time. In order to solve these bugs, I spent a lot of time on log debugging. Incidentally, I recommend a good BUG monitoring tool for youFundebug.
Original text: dev. To/amitness/a -…
communication
This article is updated every week, you can search wechat “big move the world” for the first time to read and urge more (one or two earlier than the blog hey), this article GitHub github.com/qq449245884… It has been included and sorted out a lot of my documents. Welcome Star and perfect. You can refer to the examination points for review in the interview.