By Amit Chaudhary

Click “like” and then see, wechat search ** [Big Move the world] pay attention to this person without dACHang background, but with an upward positive attitude. In this paper, making github.com/qq449245884… Has been included, the article has been categorized, also organized a lot of my documentation, and tutorial materials.

Recently open source a Vue component, is not perfect, welcome everyone to improve it, also hope you can give a star support, thank you.

Making address:Github.com/qq449245884…

In NLP, examining text against patterns or extracting content from text that matches a particular pattern is a common task. Regular expressions are a powerful tool for doing this.

Natural Language Processing (NLP) is a sub-field of artificial intelligence (AI).

Despite their power, regular expressions are often intimidating because of the many commands we need to remember and the logical ability required in complex structures.

In this article, we mainly illustrate the concepts of regular expressions by way of diagrams. The goal, of course, is to help you, myself included, build a mental model of regular expressions.

Mental models

Let’s start with a simple example where we try to find the word cool in the text.

With re, we just type the word ‘cool’ as a pattern and it matches that word.

'cool'
Copy the code

When the regular expression matches the word ‘cool’ we expect, it operates not at the word level but at the character level, which is a point we need to clear up.

Note: Regular expressions work at the character level, not the word level.

This means that the regular expression ‘cool’ will also match the following sentence.

Basic building blocks

Now that we understand the key points, let’s look at how to use regular expressions to match simple characters.

Specific character

We can specify characters in a regular expression that will match all instances in the text.

For example, the following regular expression will match all instances of ‘a’ in the text:

'a'
Copy the code

You can also use any number from 0 to 9 to match numbers.

'3'
Copy the code

Note that by default, regular expressions are case sensitive, so the following regular expressions do not match anything.

'A'
Copy the code

A space character

We can use special escape sequences to detect special characters, such as Spaces and newlines.

In addition to the common ones mentioned above, we have:

  • \renter
  • \fChange the page
  • \eExecutable mode

Special characters

Regular expressions provide a bunch of built-in special characters that can match one set of characters at a time, starting with a backslash \.

Mode:\d

It matches numbers between 0 and 9.

Notice that the match is a single digit. So instead of a single number 18.04, we have four different matches below.

Mode:\s

It matches any space character (space, TAB, or newline).

Mode:\w

It matches any lowercase letters (a to Z), uppercase letters (a to Z), digits (0 to 9), and underscores.

Mode:.

It matches any character except newline (\n).

Let the STR = 'line 1 \ nline2 STR. Match (/. / g) / / result: [" l ", "I", "n", "e", ""," 1 ", "l" and "I", "n", "e", "2"]Copy the code

Pattern: Negation

If we use the uppercase form of the above pattern, it indicates their negative side.

For example, if \d matches any number from 0 to 9, \d will match any number other than 0 to 9.

Character set

Character set patterns start with [, end with], and match characters enclosed in parentheses. For example, the following pattern matches any character ‘A ‘, ‘e’,’ I ‘, ‘O ‘, and ‘u’.

We can also use the following mode to replace the functionality of \d.

Instead of specifying all numbers, we can also use hyphen conformance – just specify the start and end numbers. Therefore, we can use [0-9] instead of [0123456789] :

For example, [2-4] can be used to match any number between 2 and 4 (that is, 2 or 3 or 4).

We can use the special characters described above in parentheses. For example, match any numeric or whitespace character between 0 and 9:

Below, some common patterns and their meanings are listed.

The anchor

Regular expressions also have special handlers to make patterns match only at the beginning or end of a string.

We can only use the ^ character to match patterns that start with a specified start. For example,

Similarly, we can use the $character after a character to indicate the end of a specified character. Such as:

Escape metacharacters

Consider a case where we want to match the word “Mr. Stark” exactly.

If we want to match with Mr. Stark in this format, it will have an unexpected effect. Because we know that dot has special meaning in regular expressions.

Therefore, if we want to match exactly the character itself, we need to escape special metacharacters such as., $, and so on.

Below is a list of metacharacters, remember to escape them if you use them directly.

^ $. * +? {} [] \ | ()Copy the code

Repeat classes

Now that we can pattern match any character, let’s move on to a slightly more complex pattern.

Stupid way to match repeated characters

Using only what has been learned so far, the naive approach is to repeat the pattern. For example, we can match two digits by repeating the character-level pattern.

\d\d
Copy the code

quantifiers

Regular expressions provide special quantifiers to specify different repetition types of characters that precede them.

Fixed repeated

We can use {… } quantifiers to specify how many times the pattern should be repeated.

For example, you can change the pattern previously used to match two digits to:

We can also specify the repetition range using the same quantifier. For example, to match two or four digits, the following pattern can be used:

When applied to a sentence, it will match both 4 and 2 digits.

Note that there should not be any Spaces between the minimum and maximum counts, for example, \d{2, 4} does not work.

Flexible quantifiers

Regular expressions also provide quantifiers *, +, and? Use it to specify flexible repetition of characters.

?The character indicates zero or one match

For example, suppose we want to match the words “sound” and “sounds”, where the “s” is optional. Can we use? Quantifiers.

+Character indicates one or more matches

For example, we can use the re \d+ to find numbers of arbitrary length.

*Character indicates 0 or more matches

Usage in Python

Python provides a module in the standard library called “re” to use regular expressions.

The need for raw strings

To specify regular expressions in Python, we create the raw string before r

pattern = r'\d'
Copy the code

To understand why we add r in front, we try printing the expression \t without **r**.

>>> pattern = '\t'
>>> print(pattern)


Copy the code

As you can see, string \t is treated as a TAB escape character in Python when we are not using the raw string.

Now we convert it to the raw string, and we’ll get whatever we specify

>>> pattern = r'\t'
>>> print(pattern)
\t
Copy the code

Using the RE module

To use the re module, we need to import it:

import re
Copy the code

1. re.findall

This function allows us to get all matches as a list of strings.

import re
re.findall(r'\d', '123456')

// ['1', '2', '3', '4', '5', '6']
Copy the code

2. re.match

This function searches for patterns at the beginning of the string and returns the first match as a match object. If the pattern is not found, None is returned.

import re

match = re.match(r'batman', 'batman is cool')
print(match)

// <re.Match object; span=(0, 6), match='batman'>
Copy the code

With the match object, we can treat the matched text as

print(match.group())

// batman
Copy the code

In cases where our pattern is not at the beginning of the sentence, we will not get any matches.

import re

match = re.match(r'batman', 'The batman is cool')
print(match)

// None
Copy the code

3. re.search

This function can also look for the first occurrence of a pattern, but the pattern can appear anywhere in the text. If the pattern is not found, None is returned.

import re

match = re.search(r'batman', 'the batman is cool')
print(match.group())


// batman
Copy the code

That’s it for today, and I’ll see you next time. Remember Triple Company.


The bugs that may exist after code deployment cannot be known in real time. In order to solve these bugs, I spent a lot of time on log debugging. Incidentally, I recommend a good BUG monitoring tool for youFundebug.

Original text: dev. To/amitness/a -…

communication

This article is updated every week, you can search wechat “big move the world” for the first time to read and urge more (one or two earlier than the blog hey), this article GitHub github.com/qq449245884… It has been included and sorted out a lot of my documents. Welcome Star and perfect. You can refer to the examination points for review in the interview.