In this section, we will explain zero-width assertion, a common and important knowledge of regular expressions. In this section, we will explain zero-width assertion, a common and important knowledge of regular expressions.

Instance is introduced into

Let’s start with an example. Here’s a question and answer dialogue:

Q: I am running Windows XP+Service Pack 2. Why can’t I install the control to input the card number and password? A: In Windows XP+Service Pack 2, Windows 2003 and other operating systems, users can choose whether to install controls or not. Q: Why do I see an * sign in the card number input box? A: Your browser forbids downloading and executing ActiveX controls. In this case, you must turn on the browser’s ActiveX permissions. Operation method: from the browser menu select “tools” | “Internet options”, in the pop-up dialog, select “security” | “” |” custom level “Internet, in the pop-up dialog box, select the” reset to security level – “dot” reset “button, sure. Q: After reading the above questions, I still cannot log in. What should I do? A: Your browser cannot install the CMB login control for other reasons, please download and install the CMB login control download version. Q: The public login page of personal online banking cannot be displayed. A: This situation is caused by your machine being unable to establish a secure connection with our bank’s server, usually due to a proxy server setup error. If you have dial-up Internet access, do not use proxy servers; If you have installed our SSL proxy in the past, please call “Add-Remove Program” to delete the SSL proxy. If you access the Internet through a proxy, contact your network administrator to set a proxy server. To set a proxy server in Internet Explorer 5.0, perform the following steps: Internet Options – > Connection – > LAN Settings – > Using a proxy Server – > Advanced. Ask: when I input account number and card number, always make mistakes, how should I input? A: The passbook account number is 10 digits, and the password is 6 digits. If the one-card number is 12 digits, you only need to enter the 8-digit card number next to the area code. You do not need to enter the first 4-digit area code. The password is 6 digits. If the one-card has a 16-digit card number, enter all 16-digit card numbers and a 6-digit password. Q: my passbook does not have a password, how to check the balance in the popular version of personal online banking? A: passbook must have a password before it can be queried in the popular version of personal online banking, so please go to the bank of deposit to set a password for your passbook. Note: Online personal banking is the online banking provided by China Merchants Bank for individual customers. The content on this page is for reference only. For some services, refer to the announcements and specific regulations of local branches.

We need to extract the following questions and answer pairs from the conversation:

Q: I am using Windows XP+Service Pack 2. Why can’t I install the control that input the card number and password?

A: In Windows XP+Service Pack 2, Windows 2003 and other operating systems, users can choose whether to install controls.

Q: Why does the card number input box I see show an * sign?

A: Your browser forbids downloading and executing ActiveX controls. In this case, you must enable your browser’s ActiveX permissions. Operation method: from the browser menu select “tools” | “Internet options”, in the pop-up dialog, select “security” | “” |” custom level “Internet, in the pop-up dialog box, select the” reset to security level – “dot” reset “button, sure.

If we were implementing it in Python, we would probably naturally think of the split() or findall() method. If we were using the split() method, we would probably write something like this:

1
2
3
4

import re
Results = re. The split (‘ q: | answer: ‘text)
for index, result in enumerate(results[1:]):
print((‘Q’ if index%2 == 0 else ‘A’) + ‘: ‘ + result)

Here the split () method of the first parameter to the q: | a: the regular expression, that is to this passage with q: or a: separate, this function is a regular expression to string segmentation method, compared with the direct of the string the split () method is more powerful. The result is actually a list of odd lengths. If we print results, it looks like this:

1

In Windows XP+Service Pack 2, Windows 2003 and other operating systems, users can choose whether to install controls or not. Your browser does not allow you to download and execute ActiveX controls. In this case, you must enable your browser’s ActiveX permissions. In the browser menu select “tools” | “Internet options”, in the pop-up dialog, select “security” | “Internet” | “custom level”, in the pop-up dialog, select “reset to a safe level -“, “reset” button, sure. “, “Your browser cannot install China Merchants Bank login control for other reasons, please download and install China Merchants Bank login control download version.”, “Personal online banking public login interface cannot appear. This situation is caused by your machine being unable to establish a secure connection with our bank server, usually due to a proxy server setup error. If you are a dial-up Internet access, please do not use a proxy server; if you have installed in the past our SSL secure proxy, please delete calls “add remove programs” SSL secure proxy; if you are after a proxy to access the Internet, please contact your network network proxy server administrator. IE5.0 browser Settings proxy server steps: Internet options –> Connection –> LAN Settings –> Use proxy server –> Advanced. ‘, ‘I am inputting account number and card number, always error, how to input? ‘,’ Passbook account number is 10 digits, according to the account number on the passbook, password is 6 digits. If the one-card number is 12-digit, you only need to enter the 8-digit area code next to the area code, and the password is 6. If the one-card number is 16-digit, you need to enter all the 16-digit card numbers and the password is 6. My passbook does not have a password how can I check the balance in the popular version of personal Online banking The passbook must have a password in the popular version of personal online banking so please go to the issuing bank to set a password for your passbook. Online personal banking is an online banking service provided by China Merchants Bank for individual customers. The content on this page is for reference only, and for some businesses, the announcements and specific regulations of local branches shall prevail.

This is because we split the character itself is in the entire text of the character, so we found the mark of the split:, so it is the left of the result is an empty string, so the final result is the first content is an empty string, the following content is a normal short sentence. So here we also need to slice the result, remove the first element, and then print it through. The final result is as follows:

1
2
3
4
5
6
7
8
9
10
11
12

Q: I am using Windows XP+Service Pack 2. Why can’t I install the control that input the card number and password?
A: In Windows XP+Service Pack 2, Windows 2003 and other operating systems, users can choose whether to install controls.
Q: Why does the card number input box I see show an * sign?
A: Your browser forbids downloading and executing ActiveX controls. In this case, you must enable your browser’s ActiveX permissions. Operation method: from the browser menu select “tools” | “Internet options”, in the pop-up dialog, select “security” | “Internet” | “custom level”, in the pop-up dialog, select “reset to a safe level -“, “reset” button, sure.
Q: After reading the above questions, I still cannot log in. What should I do?
A: Your browser cannot install the CMB login control for other reasons. Please download and install the CMB login control download version.
Q: The public login page of personal online banking cannot be displayed.
A: This situation is caused by the failure of your machine to establish A secure connection with our bank server, usually due to the wrong proxy server Settings. If you have dial-up Internet access, do not use proxy servers; If you have installed our SSL proxy in the past, please call “Add-Remove Program” to delete the SSL proxy. If you access the Internet through a proxy, contact your network administrator to set a proxy server. To set up a proxy server in Internet Explorer 5.0, go to Internet Options > Connect > LAN Settings > Use proxy Server > Advanced.
Q: I always make mistakes when I input my account number and card number. How do I input?
A: The passbook account number is 10 digits. Enter the account number according to the passbook, and the password is 6 digits. If the one-card number is 12 digits, you only need to enter the 8-digit card number next to the area code. You do not need to enter the first 4-digit area code. The password is 6 digits. If the one-card has a 16-digit card number, enter all 16-digit card numbers and a 6-digit password.
Q: My passbook has no password. How can I check my balance in the popular version of personal Online banking?
A: Passbook must have A password before it can be checked in the popular version of personal Online banking. So please go to the bank of deposit to set A password for your passbook. Note: Online personal banking is the online banking provided by China Merchants Bank for individual customers. The content on this page is for reference only. For some services, refer to the announcements and specific regulations of local branches.

That’s fine, we can extract it without a problem, but it doesn’t feel very elegant because we’re splitting the question and the answer separately, we’re not splitting the answer together, and the split() method doesn’t return the first element of the result that we want, So you have to do some slicing to get rid of it, so it doesn’t feel perfect.

So we came up with the findAll () method again, where we would say:

1234import reresults = re.findall('asked: (. *?) Answer: (. *?) ', text, re.S)for result in results:    print('Q: ' + result[0], 'A: ' + result[1], sep='\n')Copy the code

At the end of the query, we do not specify the end of the match, so the result is that the answer is not matched at all.

1
2
3
4
5
6
7
8
9
10
11
12

Q: I am using Windows XP+Service Pack 2. Why can’t I install the control that input the card number and password?
A:
Q: Why does the card number input box I see show an * sign?
A:
Q: After reading the above questions, I still cannot log in. What should I do?
A:
Q: The public login page of personal online banking cannot be displayed.
A:
Q: I always make mistakes when I input my account number and card number. How do I input?
A:
Q: My passbook has no password. How can I check my balance in the popular version of personal Online banking?
A:

The end point of a regular expression match is the end point of a regular expression match. So we might rewrite it like this:

1234import reresults = re.findall('asked: (. *?) Answer: (. *?) Q:", text, re.S)for result in results:    print('Q: ' + result[0], 'A: ' + result[1], sep='\n')Copy the code

This may seem like a good idea, but it turns out to be this:

1
2
3
4
5
6

Q: I am using Windows XP+Service Pack 2. Why can’t I install the control that input the card number and password?
A: In Windows XP+Service Pack 2, Windows 2003 and other operating systems, users can choose whether to install controls.
Q: After reading the above questions, I still cannot log in. What should I do?
A: Your browser cannot install the CMB login control for other reasons. Please download and install the CMB login control download version.
Q: I always make mistakes when I input my account number and card number. How do I input?
A: The passbook account number is 10 digits. Enter the account number according to the passbook, and the password is 6 digits. If the one-card number is 12 digits, you only need to enter the 8-digit card number next to the area code. You do not need to enter the first 4-digit area code. The password is 6 digits. If the one-card has a 16-digit card number, enter all 16-digit card numbers and a 6-digit password.

The findall() method will findall results that match the regular expression, but it also has an internal lookup index scanning for matches. When we find the first result that meets the requirement, our search index has moved to the first question at the beginning of the second question pair, since we end the query by asking: at the end of the regular expression. Above, the index is already in the position of the second question pair, and the next time it finds a result that meets the requirement, the index moves back to scan, so it asks from the second question pair: So the second question pair is actually split, so it can only find regular expression content when it looks for the third question pair. Therefore, we can observe that the results returned are only the first, third and fifth question pairs.

So, if we want to use this method to find the complete retention pair, we need to use the zero-width assertion.

The solution is as follows:

1234import reresults = re.findall('asked: (. *?) Answer: (. *?) (? = q: | \ Z) ', text, re.S)for result in results:    print('Q: ' + result[0], 'A: ' + result[1], sep='\n')Copy the code

The running results are as follows:

1
2
3
4
5
6
7
8
9
10
11
12

Q: I am using Windows XP+Service Pack 2. Why can’t I install the control that input the card number and password?
A: In Windows XP+Service Pack 2, Windows 2003 and other operating systems, users can choose whether to install controls.
Q: Why does the card number input box I see show an * sign?
A: Your browser forbids downloading and executing ActiveX controls. In this case, you must enable your browser’s ActiveX permissions. Operation method: from the browser menu select “tools” | “Internet options”, in the pop-up dialog, select “security” | “Internet” | “custom level”, in the pop-up dialog, select “reset to a safe level -“, “reset” button, sure.
Q: After reading the above questions, I still cannot log in. What should I do?
A: Your browser cannot install the CMB login control for other reasons. Please download and install the CMB login control download version.
Q: The public login page of personal online banking cannot be displayed.
A: This situation is caused by the failure of your machine to establish A secure connection with our bank server, usually due to the wrong proxy server Settings. If you have dial-up Internet access, do not use proxy servers; If you have installed our SSL proxy in the past, please call “Add-Remove Program” to delete the SSL proxy. If you access the Internet through a proxy, contact your network administrator to set a proxy server. To set up a proxy server in Internet Explorer 5.0, go to Internet Options > Connect > LAN Settings > Use proxy Server > Advanced.
Q: I always make mistakes when I input my account number and card number. How do I input?
A: The passbook account number is 10 digits. Enter the account number according to the passbook, and the password is 6 digits. If the one-card number is 12 digits, you only need to enter the 8-digit card number next to the area code. You do not need to enter the first 4-digit area code. The password is 6 digits. If the one-card has a 16-digit card number, enter all 16-digit card numbers and a 6-digit password.
Q: My passbook has no password. How can I check my balance in the popular version of personal Online banking?
A: Passbook must have A password before it can be checked in the popular version of personal Online banking. So please go to the bank of deposit to set A password for your passbook. Note: Online personal banking is the online banking provided by China Merchants Bank for individual customers. The content on this page is for reference only. For some services, refer to the announcements and specific regulations of local branches.

Here we are actually using (? =), or the end character \Z. This actually guarantees that the search index will not move further back during a match, but it also marks the end flag, so that it can find the full content.

Zero width assertion

A zero-width assertion, as its name implies, is a zero-width match that does not store what is matched in the result of the match. The match content of an expression simply represents a position, such as how the right boundary of a character is constructed.

We used? =, this is one of them. What else is there? < =,? ! ,?

  • ? =Represents a zero-width positive prediction ahead assertion, which asserts that the position following its occurrence matches the following expression.
  • ? < =Represents zero-width retrospective postassertion, which asserts that the position before its occurrence matches the expression that follows it.
  • ? !Represents zero-width negative predictive preemption assertion, which asserts that the position following its occurrence cannot match the following expression.
  • ? <!Represents zero-width negative retrospective assertion, which asserts that the position after itself does not match the following expression.

? =

First of all, what are we going to do? =, which asserts that the position after its occurrence matches the expression that follows it.

Let’s say we have a string like this:

1str = 'My personal email is [email protected], my personal blog is cuiqingcai.com, my personal official account is the Coder of attacks'Copy the code

Here we want to separate the statement “my personal email” from the statement “my personal email”. If we don’t use zero-width assertion, we need to add an end identifier to the statement after the statement “my personal email” or a separate matching email as an identifier. We might write:

1234import restr = 'My personal email is [email protected], my personal blog is cuiqingcai.com, my personal official account is the Coder of attacks'result = re.search('My personal email is (.*?). , personal blog ', str)print('Whole sentence result:' + result.group(), 'First match result:' + result.group(1), sep='\n')Copy the code

At the end of the regular expression we add “personal blog” as the end of the match, and then the mailbox part of the match with the pattern of non-greedy match, let’s look at the result:

1
2

My personal email is [email protected], my personal blog
First matching result: [email protected]

We can see that the first matching result successfully got the email information, but we can see that the whole sentence result is not ideal, it matches the ending logo we added, but does not get a normal sentence.

What if we use? =, the result will not have this identifier, rewrite as follows:

1
2
3
4

import re
STR = ‘my personal email is [email protected], my personal blog is cuiqingcai.com, my personal official account is the attack Coder’
Result = re.search(‘ My personal email address is (.*?). (? =, personal blog)’, STR)
Print (‘ result.group() ‘, ‘result.group(1)’, sep=’\n’)

Here we have changed the closing identifier to (? = personal blog), so this part of the content is matched as zero width, which means that the personal blog needs to be followed, but it does not appear in the matching result.

The running results are as follows:

1
2

My personal email is [email protected]
First matching result: [email protected]

You can see that there are no useless suffix characters in the result of the whole sentence.

? < =

Now what do we do? <= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

1
2
3
4

import re
STR = ‘my personal email is [email protected], my personal blog is cuiqingcai.com, my personal official account is the attack Coder’
result = re.search(‘(? <=,) Personal blog is (.*?). (? =) ‘, STR)
Print (‘ result.group() ‘, ‘result.group(1)’, sep=’\n’)

Here we add a zero-width assertion comma to the beginning of the personal blog, using? <=, the end of the sentence? =, so the identifiers before and after will not match, and the result is as follows:

1
2

My personal blog is Cuiqingcai.com
First matching result: cuiqingcai.com

You can see that the whole sentence is a whole sentence.

? !

? ! Represents zero-width negative predictive preemption assertion, which asserts that the position following its occurrence cannot match the following expression. It is also used to match the following text, but this is the inverse, which specifies that the following content does not match the flag, we modify the previous example as follows:

1
2
3
4

import re
STR = ‘my personal email is [email protected], my personal blog is cuiqingcai.com, my personal official account is the attack Coder’
Result = re.search(‘ My personal email address is (.*?). (? ! , personal public account)(? =, personal blog)’, STR)
Print (‘ result.group() ‘, ‘result.group(1)’, sep=’\n’)

It is (? =, personal blog) identifier, but here we use? ! To specify another identifier, the personal public number, which represents the need for the following statement (? = personal blog) instead of personal public account, the result is as follows:

1
2

My personal email is [email protected]
First matching result: [email protected]

?

?

1
2
3
4

import re
STR = ‘my personal email is [email protected], my personal blog is cuiqingcai.com, my personal official account is the attack Coder’
result = re.search(‘(? (< =)?

Print (‘ result.group() ‘, ‘result.group(1)’, sep=’\n’)

So what do we have here?

The running results are as follows:

1
2

My personal blog is Cuiqingcai.com
First matching result: cuiqingcai.com

Common usage

In the example above, we use the search() method to match the content. This is not very common because we are more concerned with matching the contents of the grouped results. In fact, we use the findAll () method to match multiple results, just like our original example. Here we still take the string as an example, to output personal email, personal blog, personal public number three content, the code is as follows:

1
2
3
4
5

import re
STR = ‘my personal email is [email protected], my personal blog is cuiqingcai.com, my personal official account is the attack Coder’
Results = re.findAll (‘ personal (.*?)) results = Re.findAll (‘ personal (.*?)) Is (. *?) (? = | \ Z) ‘, STR)
for result in results:
print(result[0] + ‘: ‘ + result[1])

Here we match the individual word, and then followed the match not greed, then add a word, the key is the ending identifier, there must be using zero width assertion can match three as a result, here is the content of the match, | \ Z, means that matches a comma or end.

The running results are as follows:

1
2
3

Email address: [email protected]
Blog: cuiqingcai.com
Public id: attack Coder

In this way, we successfully output the content of the mailbox, blog and public number, and the match is very smooth and convenient.

conclusion

In this section, we should have a general understanding of the basic usage and application scenarios of zero-width assertions in regular expressions. We believe that after understanding zero-width assertions, we will be more comfortable with regular matching.

This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience

For more crawler information, please follow my personal wechat official account: Attack Coder

Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)