The article directories
- Regular expression
-
- concept
- Constitute a
- Re module application
- Common regular expressions
-
- digital
- character
- other
- summary
Regular expression
concept
Regular expressions, as a concept of computer science, are usually used to retrieve and replace text that conforms to certain rules. A regular expression is a logical formula for manipulating strings. It uses predefined regular strings to filter strings.
Regular expressions are essentially a small, highly specialized programming language. In Python, regular expressions are implemented through the RE module. A regular expression can specify rules for the corresponding set of strings to be matched, and then modify or delimit strings in some way through the RE module.
The regular expression pattern is compiled into a series of bytecodes and executed by a matching engine written in C, so it is somewhat faster than writing Python string processing code directly. But not all string matching can be done with regular expressions, and even if you can handle single expressions, it can be complicated and unreadable, which is recommended to write Python code directly.
Constitute a
A regular expression consists of two types of characters: metacharacters that have special meaning in the regular expression and general characters.
Characters and syntax:
grammar | instructions | Expression example | Match for |
---|---|---|---|
Normal character | The match itself | Abc | abc |
. | Matches any character except newline character \n | a.c | abc |
\ | Escape character | a\.c | a.c |
[abcd] | Matches a or B or C or D | [abc] | a |
[0-9] | Matches any number from 0 to 9, equivalent to [0123456789] | [0, 3] | 1 |
[\u4e00-\u9fa5] | Match any Chinese character | [\u4e00-\u9fa5] | han |
[^a0=] | Matches any character except a, 0, and = | [^abc] | d |
[^a-z] | Matches any character except lowercase characters | [^a-z] | A |
\d | Match any number, equivalent to [0-9] | a\dc | a6c |
\D | Matches any non-numeric character, equivalent to [^0-9] | a\Dc | abc |
\s | Matches any whitespace character, equivalent to [\r\n\f\t\v] | a\sc | a c |
\S | Matches any non-whitespace character, equivalent to [^\r\n\f\t\v] | a\Sc | aYc |
\w | Matches any letter, digit, or underscore, equivalent to [A-zA-Z0-9_] | a\wc | a_c |
\W | Matches any non-letter, digit, or underscore, equivalent to [^ A-zA-z0-9_] | a\wc | a*c |
* | Matches the preceding character 0 or an infinite number of times | a*c | c |
+ | Matches the previous character 1 or an infinite number of times | a+c | aaaac |
? | Matches the preceding character 0 or 1 times | a? c | ac |
{m} | Matches the previous character m times | a{3}c | aaac |
{m,n} | Matches the preceding character m to n times. Mn can be omitted. The default value of mn is 0 and infinite, respectively | A {1, 2} c | aac |
^ | Matches the start of the string, not any characters | ^abc | abc |
$ | Matches the end of the string, not any characters | abc$ | abc |
| | Subexpression or relationship matching | abc|def | def |
(…). | Match the grouping | (abc){2} | abcabc |
(? P… | Matches groups, specifying an additional alias in addition to the original number | (? Pabc){2} | abcabc |
\ | Matches groups with reference numbers to the string | (\d)abc\1 | 1abc1 |
(? ..) | Matches ungrouped (…) , followed by quantifiers | (? :abc){2} | abcabc |
(? iLmsux) | Each character of iLmsux represents a matching pattern and can only be used at the beginning of a string | (? i)abc | AbC |
(? #…). | The content after # will be ignored by comments | a(? #test)bc | abc |
(? (id/name)yes-pattern|no-pattern) | Matching groups with id or alias name must match yes-pattern. Otherwise, matching no-pattern is required, which is similar to the ternary operator | (\d)abc(? (1)\d|abc) | 1abc2 |
The rules above are for string matching only, and in practice are often a combination of multiple single matches, so it’s best to learn them in order to get started with Python. While memorizing these rules directly is tedious to teach, the following will combine Python’s RE module to explain them in order to master them.
Re module application
The re module has been added to Python since version 1.5, providing perl-style regular expression patterns. The re module is embedded in Python, so it can be imported directly, using method __version__ to view the version, and method __all__ to view attribute methods:
import re
print(re.__version__)
print(re.__all__)
The output is as follows:
2.21.
['match'.'fullmatch'.'search'.'sub'.'subn'.'split'.'findall'.'finditer'.'compile'.'purge'.'template'.'escape'.'error'.'Pattern'.'Match'.'A'.'I'.'L'.'M'.'S'.'X'.'U'.'ASCII'.'IGNORECASE'.'LOCALE'.'MULTILINE'.'DOTALL'.'VERBOSE'.'UNICODE']
Copy the code
The above code shows that the RE module does not involve many functions. The function is to find patterns in the text, the second is to compile expressions, and the third is to match multiple layers. At the same time, it also defines some constants.
The search() function is used primarily to find patterns in text. This function has three parameters: pattern, string, and flags.
- Pattern represents the expression string used at compile time
- String indicates the string to be matched
- Flags indicates the compilation flag, which is used to change the matching mode of the regular expression, such as case sensitivity and multi-line matching. The default value is 0. Common flag values are as follows:
mark | meaning |
---|---|
re.S(DOTALL) | The “.” Matches all characters, including newlines |
re.I(IGNORECASE) | Make the match case insensitive |
re.L(LOCALE) | Do localization recognition matching and so on |
re.M(MULTILINE) | Multi-line matching, affecting ^ and $ |
re.X(VERBOSE) | Make regular expressions easier to understand with a more flexible format |
re.U | Resolves characters according to the Unicode character set, affecting \w, \w, \b, \b |
re.serach()
The function takes a pattern and scanned text as input and returns a match, or None if no pattern is found:
See from thematch
Contains information about the nature of the match for the returned match object. The position at which the pattern appears in the original string when using a regular expression, hasstart()
,end()
,group()
,span()
,groups()
Methods:
start()
Returns the starting position of the matchend()
Returns the end of the matchgroup()
Returns the matched stringspan()
Returns a tuple containing the matching (start, end) positiongroups()
Returns a tuple containing all of the subgroup strings in the regular expression, from 1 to the contained subgroup number, usually without arguments. In addition, there is anothergroup(n,m)
Method to return the string matched by group number (n,m).
Third, compile() precompile
Using the functioncompile()
Regular expressions are compiled into regular expression objects to improve execution efficiency. This function returns an object schema withpattern
,flags=0
Two parameters, meaning and abovesearch()
Consistency is mentioned in.
Usually compiled expressions are frequently used by the program, which is more efficient to compile, but also requires some caching overhead. Another advantage of using compiled expressions is that all expressions are compiled as soon as the module is loaded, rather than when the program responds to user action.
四, Match () start position match
Using the functionmatch()
Matches at the beginning of the text string. This method is not a complete match, but only matches at the beginning of the string.
Findall () returns a list of all matched strings in a string. This function takes the same effect as the search() function, but returns all matching and non-overlapping substrings.
functionfinditer()
How to use andfindall()
Same, except that it returns an iterator instead of a list. It will generate Match instances.
Address of the blogger CSDN: wzlodq.blog.csdn.net/
A: We split it up
functionsplit()
The ability to split the string to be matched in a substring and return a list.
Sub () and subn()
functionsub
Replaces each matching substring in the string with pattern and returns the replaced substring in the formatre.sub(pattern,repl,string,count,flag)
. The functionsubn()
Returns one more number of substitutions on this basis.
- Pattern is an expression string
- Repel is the replaced character
- String is a string used for matching
- Count is the maximum number of substitutions
- Flag with the above
Common regular expressions
digital
Numeric expression verification is mainly used to match the regular expression correction of the numbers in the text. The following will explain some common expressions and use the RE module to process them.
^ [0-9] * $
And as mentioned earlier^
Is the start of the matching string;$
Is the end position of the matching character;[0-9]
Represents any number;*
Matches the preceding character 0 or umpteenth times. The following will not be repeated, not clear, please refer to the above.
In summary, this expression is used to match numbers.
^[1-9]\d*$
Matches non-0 positive integers
^\d{n}$
Matches n digits
^\d{n,}$
At least n digits are matched
^\d{m,n}$
Matches numbers from m to N bits
^ (1-9] [0-9] * [0-9] {1, 2}? $
Matches non-zero numbers with at most two decimal digits
character
In text analysis, character expression processing is often involved, such as extracting Chinese characters and deleting characters of length.
[\u4e00-\u9fa5]
Chinese characters matching
^[A-Za-z0-9]+$
English and number matching
^[\u4E00-\u9FA5AA-Za-z0-9]+$
Chinese and English numbers matching
other
^\w+(-+.\w)*@\w+([-.]\w+)*\.\w+([-.]\w+)*$
Verify the E-mail address
^1[34589]\\d{9}$
Mobile phone number
^[1-9]\d{5}(18|19|([23]\d))\d{2}((0[1-9])|(10|11|12))(([0-2][1-9])|10|20|30|31)\d{3}[0-9Xx]$
Id number
^[1-9]\d{5}(? ! \d)$
The zip code
(? i)^([a-z0-9]+(-[a-z0-9]+)*\.) +[a-z]{2,}$
The domain name
summary
The most important function of the regular expression RE module is filtering. It filters out the required data from the target, and then filters out any characteristic data from the string through function combination, which is the basis for subsequent Python crawler data parsing.
The Python blog series continues to be updated
Original is not easy, please do not reprint (this is not rich visitors to add to the problem) blogger homepage: wzlodq.blog.csdn.net/ wechat public number: weow lo dong qiang if the article is helpful to you, remember a key three connect ❤