When learning regular expressions, lookaround is often a source of frustration for beginners. But if you get the gist, the confusion will disappear in no time.

Lookaround is actually divided into two parts: lookahead and lookbehind.

Note: the translation here is based on my personal understanding, other places may have other terms

The introduction

When we learn something, we need to consider its use scenarios, if there is no clear purpose, the knowledge learned will not be very profound.

Consider this scenario: When we perform user login or registration verification, we will check whether the password entered by the user is valid. For example, we have the following requirements for the password entered by the user:

  • At least 6 characters
  • Contains one lowercase letter
  • Contains one uppercase letter
  • Contains a number

If you don’t use regular expressions, most people will probably use if statements to check each requirement individually. This code is not less written when learning, can not say that there is nothing wrong with this way of writing, but it is not so elegant.

So is there a better solution? If you are not familiar with regular expressions, you may encounter the following problem: How to determine if a string contains at least one uppercase and lowercase letter and number? If you start writing, you’ll find that you can’t guarantee that you’ll write an expression that doesn’t care about order.

A brief explanation of lookaround

So how to solve this problem? How can I use a regular expression to determine whether a password contains at least one uppercase and lowercase letter and number regardless of the character order?

This brings us to the two parts of lookaround we’ll cover next: Lookahead and lookbehind.

When using lookahead and lookbehind, the regular expression does not move over the string as it processes the string, which means that we can use this technique or technique to determine in advance whether the string conforms to certain conditions.

So, before we go any further, let’s take a look at four ways to write lookaround, assuming you already know basic regular expression syntax.

lookaround The name of the What did
(? =foo) Lookahead Determines whether something immediately following the current position in the stringis foo
(? ! foo) Negative lookahead Determines whether something immediately following the current position in the stringnot foo
(? <=foo) lookbehind Determines whether the content immediately preceding the current position in the stringis foo
(? <! foo) Negative lookbehind Determines whether the content immediately preceding the current position in the stringnot foo

Note: Foo above can be replaced with regular expressions, which would be even more powerful

A simple example of lookaround

Can’t understand the above introduction? So to give the reader a simple idea of what the four circles are doing, let’s start with a simple example. For now, let’s assume that the current string is foobarbarfoo:

example describe
bar(? =bar) Match the first onebar(Because the first onebarbehindFollowed byabar)
bar(? ! bar) Match the secondbar(Because the second onebarbehindNot immediately followingabar)
(? <=foo)bar Match the first onebar(Because the first onebarIn front of theFollowed by foo)
(? <! foo)bar Match the secondbar(Because the second onebarIn front of theNot immediately following foo)

In each of the above examples, the word “immediately following” is emphasized in relation to the bar string outside the parentheses, that is, to determine whether the string bar immediately before and after the requirements.

To solve the problem

Now that we’ve covered the concept of lookaround and a simple example, let’s go back to the example we started with: how to use regular expressions to determine if a password is correct?

Let’s tackle a few needs step by step. First up: at least 6 characters. This is easy to solve, make sure the password is composed of uppercase and lowercase letters and numbers, and is at least 6 characters long.

^[A-Za-z0-9]{6,}$
Copy the code

^ is used to match the beginning, $is used to match the end, [a-za-z0-9] is used to match upper and lower case letters and digits, and {6,} is at least 6 characters in length.

So the second requirement: include a lowercase letter. Front, when we describe the few examples are emphasized to keep up with the current location in the string so there may be readers will be confused, since all such requirements, so can’t not emphasize one order, but the fact is, we can modify the expression to achieve this goal, modify the original regular expressions, meet the requirements of now:

^ (? =.*[a-z])[A-Za-z0-9]{6,}$Copy the code

Instead of writing it as a string, we use an expression:.*[a-z]. * matches zero to multiple characters, and [a-z] matches lowercase letters. The following string must contain at least one lowercase letter. The following string must contain at least one lowercase letter. Note that this is just a judgment. The regular expression does not move on the string when scanning around. If there are no characters that meet the requirements, the scan ends.

This is, of course, greedy, or lazy: (? = (. *? [a-z]), that is, after the *? , so that as soon as the first matching character appears, the match stops and the scan continues. If you don’t understand greedy and lazy, you can skip this paragraph for a brief explanation at the end of the article.

The next two requirements: include an uppercase letter and include a number. The principle is the same as the second requirement, and the final implementation is directly given here:

^ (? =.*[A-Z])(? =.*[a-z])(? =.*[0-9])[A-Za-z0-9]{6,}$Copy the code

It is worth noting that (? = (. *? [a-z]) and (? = (. *? [a-z]) and (? = (. *? [0-9]) the three expressions will be scanned and judged successively. As long as there is any inconsistency, the match will be stopped and the match will fail. The regular expression does not move across the string during scanning, so the three expressions are written in no order.

That solves the problem we raised at the beginning of this article, and with regular expressions, you can try it out in whatever programming language you like or are using.

This article was inspired by CodeWars’ Regex Password Validation

The solution can also refer to my implementation

Greedy mode and lazy mode

  • Greed means matching the longest string
  • Being lazy means matching the shortest string

For example, given a string InnoFang.

  • For greedy mode, the regular expression isI.*n, the matching text output isInnoFang
  • For lazy mode, the regular expression isI.*? n, the matching text output isInnoFang

The difference in writing between the two is that the lazy mode is more likely than the greedy mode to be used in situations such as *, +,? The symbol that limits the number of matches is followed by a? .

reference

  • Mastering Lookahead and Lookbehind
  • Regex lookahead, lookbehind and atomic groups
  • What do ‘lazy’ and ‘greedy’ mean in the context of regular expressions?