Due to the recent need to use crawler data for testing, so I started the crawler pit filling tour, so the first thing is to systematically learn the knowledge related to regex. So the following regular knowledge points made a collation. The language environment is Python. I’ll focus on Python’s Re module.

The syntax below I listed the main part of the remaining in the python website direct access to: docs.python.org/3/library/r…

I. Summary of basic grammar

1.1. Match a single character

a . \d \D \w \W \s \S […] [^…].

Matches a single character (.)

Rule: Matches any character except newline In [24]: re.findAll ("f.o"."foo is not fao")
Out[24]: ['foo'.'fao']
Copy the code

Matches any (non-) numeric character (\d \d)

\d   [0-9]
\D   [^0-9]
Copy the code

Matches any (non-) ordinary character (\w \w)

\w ordinary characters include [_0-9A-zA-z] as well as Chinese characters \w non-ordinary charactersCopy the code

Matches any (non-) null character (\s \s)

\s matches any null character [\r\n\t] \s matches any non-null characterCopy the code

Matching character set ([…] )

[A-Z][a-z][0-9][_123a-z]

Matching character Set ([^…]) )

[^ ABC] --> any character except a, b, and CCopy the code

1.2. Match multiple characters

* Matches 0 or more times + matches 1 or more times? Match 0 times or 1 {m} Match m times {m,n} match any time between m and n timesCopy the code

1.3. Matching position

^ Match start position $match end position \A match start position \Z match end position \b match word boundary position (generally used for capitalization match) \b match non-word boundary problemCopy the code

1.4, escape

A special character needs to be transferred in a regular expression. You only need to add \ between the special characters to indicate the transfer

. * +? | \ ^ $[] {} ()

1.5, subgroups

You can use () to create internal groups for a regular expression. Subgroups are part of a regular expression and can be considered as an internal whole.

In [61]: re.search(r"(https|http|ftp):\/\/\w+\.\w+\.(com|cn)"."https://www.baidu.com").group(0)
Out[61]: 'https://www.baidu.com'

In [62]: re.search(r"(https|http|ftp):\/\/\w+\.\w+\.(com|cn)"."https://www.baidu.com").group(1)
Out[62]: 'https'
Copy the code

1.6 greedy mode and non-greedy mode

The repeated matching of regular expressions always matches as much backwards as possible. Greed mode includes: * +? {m,n}

Non-greedy mode: match as little content as possible Greedy mode switch to non-greedy mode: *? +? ?? {m,n}?

In [106]: re.findall(r"ab+?"."abbbbbbbb")
Out[106]: ['ab']

In [107]: re.findall(r"ab??"."abbbbbbbb")
Out[107]: ['a']
Copy the code

Re module

The arguments in all my functions are explained as follows:

Pattern: regular expression string: target string pos: truncate the start position of the target string endpose: truncate the end position of the target string Flags: function flags replaceStr: replace the string Max: replace at most several places (default: replace all places)Copy the code

We can see that there is a relationship between the re module, the regex object, and the match object in Python.

  • 1. The compile method of the re module returns a regex object
  • 2. The finditer(), fullmatch(), match(), search() methods of the RE module and regex object return a match object
  • 3. They have their own attributes and methods

2.1, the compile

regex =  re.compile(pattern, flags = 0)  Generate a regular expression object

Copy the code

2.2, the.findall

re.findall(pattern,string,pos,endpose)  Matches all eligible content from the target string
Copy the code

2.3, the split

re.split(pattern,string,flags) # Split the target string based on the regular expression

In [79]: re.split(r'\s+'."Hello World")
Out[79]: ['Hello'.'World']
Copy the code

2.4, sub

re.sub(pattern,replaceStr,string,max,flags)

In [80]: re.sub(r'\s+'."# #"."hello world")
Out[80]: 'hello##world'
Copy the code

2.5, subn

re.subn(pattern,replaceStr,string,max,flags)  # function the same as sub, but the return value returns the replaced string and replaced several places

In [80]: re.sub(r'\s+'."# #"."hello world")
Out[80]: ('hello##world'1),Copy the code

2.6, finditer

re.finditer(pattern,string)  Return a match object that calls group() to get the value

In [87]: it = re.finditer(r'\d+'."2014nianshiqiqngduo 08aoyun 512dizhen")

In [88]: for i init: .... :print(i) .... : <_sre.SRE_Match object at 0x7f0639767920> <_sre.SRE_Match object at 0x7f0639767ac0> <_sre.SRE_Match object at 0x7f0639767920> In [93]: it = re.finditer(r'\d+'."2014nianshiqiqngduo 08aoyun 512dizhen")

In [94]: for i init: .... :print(i.group()) .... : 2014 euro 512Copy the code

2.7, fullmatch

fullmatch(pattern,string,flags) # exactly matches the target string, equivalent to adding ^ and $
Copy the code

2.8, the match

re.match(pattern,string,flags)  Matches the position at the beginning of the target string

Copy the code

2.9, the search

re.search(pattern,string,flags) The regular expression matches the target string, only the first part
Copy the code

Three, some exercises

3.1 match words with a capital letter

import re
f = open('test.txt')
pattern= r'\b[A-Z][a-zA-Z]*\s*'
# pattern= r'\b[A-Z]\S'
L = []

for i in f:
    L += re.findall(pattern,i)
print(L)
Copy the code

The contents of the test. TXT file are as follows:

Hello World -12.6 Nihao 123 How are you -12 1.24 ASDK 34%, accounted for 1/2 2003-2005./%Copy the code

3.2 Match numbers (positive, negative, decimal, percentage, fraction)

import re
pattern = "-? \d+((/? \d+)|((\.) ? \d+)|((\%)?) )"
f = open('test.txt')
l = []
for line in f:
    l += re.finditer(pattern,line)

for i in l:
    print(i.group())
Copy the code